The folks on the iNat team who work on the iNaturalist Computer Vision Model and the iNaturalist Geomodel spend a lot of time thinking about how accurate the Machine Learning (ML) models are, and how we can improve them. It's a particular challenge since our computer vision training dataset grows by over a thousand taxa and by over a million photos every month.
We tend to visualize accuracy using charts like you'll see below, and we use these charts to validate our models internally: are they ready for prime time? Is this month's model good enough for release?
Sometimes our accuracy numbers would drift from month to month, and recently we set out to explore why that might be.
Some of this could be due to test set sampling differences - if we test using a batch A of photos for January and batch B for February, then we can expect to get slightly different results. It's also a little tricky to compare models directly to each other since the taxonomies are different. The Feb model knows about taxa that the January model didn't, for example.
Another theoretical reason might be some combination of ML training run differences: random initialization, optimizer, and variance in batch selection. In our experience this doesn't look like a big source of model accuracy difference from month to month but we'll keep an eye on it.
Another reason why our models might show different accuracy numbers from month to month is training set sampling error. We previously described the transfer learning strategy that we are using to train these monthly models. Models share the same foundational knowledge, but we use new photos each month to fine-tune the base model. For commonly observed taxa, we don't train on every image we have. Instead, we select at most 1,000 photos for training. For a taxon like Western Honey Bee, which our users have upload more than 400,000 observations of, every month we select a different fraction of photos to train on. Some months we might get better or worse photos than others, and that might result in better or worse performance.
In order to start exploring this last reason, when we made the export for the 2.11 model, we also made an alternate version of the export with the same taxonomy but selecting a different batch of photos. Then we trained an alternate model on this alternate export, and we did some comparison.
For each of the major groupings shown in the chart below, we selected 1,000 random RG observations not seen by the model during training process and compared the species ID'd by the community to the species predicted by the model suggestions. If the model agreed with the community, we considered the model correct, while if the model disagreed, we considered it incorrect.
Here's a chart showing the two models, accuracy performance compared. The y-axis shows top1 accuracy as a percentage, the x-axis shows major clade or geographical groupings, and the colors represent the different models compared against each other. The *-vision colors are computer vision model accuracy only, and the *-combined colors show accuracy when combining the computer vision model with our geo model.
(Note: I'm red-green colorblind and I have some difficulty distinguishing the 2.11-alternate-vision and 2.11-alternate-combined colors, but they are in the same order in each major grouping, so I can still interpret the chart. If anyone is struggling to interpret the chart, please let me know and I'll see if I can make a more accessible variant.)
For someone who hasn't seen one of these charts before, I'd like to point out a few interesting interpretations or conclusions. First, combining computer vision and geo modeling performance helps a lot, sometimes as much as 20%. Second, we can see our dataset bias in this result: our models know a lot more about taxa in North America and Europe than it does about taxa in South America or Africa. This mirrors our observer community and our dataset of photos to train on. When we have more images from an area, we have better performance. There are some other interesting questions to explore about why our models perform better or worse on some taxonomy groupings like herps or mollusks compared to birds or plants.
However, in the context of understanding this alternate experiment, we can takeaway that two models trained on the same taxonomy but different photos will see a small degree of variance in both computer vision and combined performance on the same test set.
Here's a look at how the 2.11-alternate model improved (or didn't) when compared directly against the 2.11 model:
We can see from this that we can probably expect less than a 2% variance from model run to model run based on sampling.
From this, we can do a better job of characterizing our model accuracy month-to-month: are things improving, getting worse, staying the same? It looks to us like our models are staying about the same in average accuracy while adding 1,000 new taxa a month. We think that's a pretty great result.
Another great way we can use the this experiment is to judge how much a change to model training or technology is really improving model accuracy. If we try out a supposedly better piece of technology to train a future model, but we see less than a 2% accuracy bump, then we should probably be a little suspicious.
We're excited to share the results here with you, and we plan to share accuracy metrics about our models when we release them from now on.