Can you please elaborate a bit more on the model architecture and what you tried with respect to transfer learning?
Did you use an imagenet architecture e.g. VGG and retrain from scratch or a custom architecture? Did you try chop off the last 1/2/3 layers of a prerrained mode and fine-tune?
Bonus points: 1. How much better were your results trained from scratch vs fine-tuned? 2. How long did it take to train your model and on what hardware?
:)
We ended up with a custom architecture trained from scratch due to runtime constraints more so than accuracy reasons (the inference runs on phones, so we have to be efficient with CPU + memory), but that model also ended up being the most accurate model we could build in the time we had. (With more time/resources I have no doubt I could have achieved better accuracy with a heavier model!)
Training the final model took about 80 hours on a single Nvidia GTX 980 Ti (the best thing I could hook to my MacBook Pro at the time). That's for 240 epochs (150k images in an epoch) ran in 3 rate annealing phases, each phase being a handful of CLR (cyclical learning rate) phases.
I'll answer in more detail in the full blogpost, it's a bit complicated to explain in a comment. I'll have charts & figures for y'all :)
Thanks for sharing all the tech details too, it's been great to read. I'm even more amazed to see it as a real app, that I didn't expect!
Edit: Just read your bio. Now it makes sense!