Some thoughts on ML accuracy

The folks on the iNat team who work on the iNaturalist Computer Vision Model and the iNaturalist Geomodel spend a lot of time thinking about how accurate the Machine Learning (ML) models are, and how we can improve them. It's a particular challenge since our computer vision training dataset grows by over a thousand taxa and by over a million photos every month.

We tend to visualize accuracy using charts like you'll see below, and we use these charts to validate our models internally: are they ready for prime time? Is this month's model good enough for release?

Sometimes our accuracy numbers would drift from month to month, and recently we set out to explore why that might be.

Some of this could be due to test set sampling differences - if we test using a batch A of photos for January and batch B for February, then we can expect to get slightly different results. It's also a little tricky to compare models directly to each other since the taxonomies are different. The Feb model knows about taxa that the January model didn't, for example.

Another theoretical reason might be some combination of ML training run differences: random initialization, optimizer, and variance in batch selection. In our experience this doesn't look like a big source of model accuracy difference from month to month but we'll keep an eye on it.

Another reason why our models might show different accuracy numbers from month to month is training set sampling error. We previously described the transfer learning strategy that we are using to train these monthly models. Models share the same foundational knowledge, but we use new photos each month to fine-tune the base model. For commonly observed taxa, we don't train on every image we have. Instead, we select at most 1,000 photos for training. For a taxon like Western Honey Bee, which our users have upload more than 400,000 observations of, every month we select a different fraction of photos to train on. Some months we might get better or worse photos than others, and that might result in better or worse performance.

In order to start exploring this last reason, when we made the export for the 2.11 model, we also made an alternate version of the export with the same taxonomy but selecting a different batch of photos. Then we trained an alternate model on this alternate export, and we did some comparison.

For each of the major groupings shown in the chart below, we selected 1,000 random RG observations not seen by the model during training process and compared the species ID'd by the community to the species predicted by the model suggestions. If the model agreed with the community, we considered the model correct, while if the model disagreed, we considered it incorrect.

Here's a chart showing the two models, accuracy performance compared. The y-axis shows top1 accuracy as a percentage, the x-axis shows major clade or geographical groupings, and the colors represent the different models compared against each other. The *-vision colors are computer vision model accuracy only, and the *-combined colors show accuracy when combining the computer vision model with our geo model.

(Note: I'm red-green colorblind and I have some difficulty distinguishing the 2.11-alternate-vision and 2.11-alternate-combined colors, but they are in the same order in each major grouping, so I can still interpret the chart. If anyone is struggling to interpret the chart, please let me know and I'll see if I can make a more accessible variant.)

For someone who hasn't seen one of these charts before, I'd like to point out a few interesting interpretations or conclusions. First, combining computer vision and geo modeling performance helps a lot, sometimes as much as 20%. Second, we can see our dataset bias in this result: our models know a lot more about taxa in North America and Europe than it does about taxa in South America or Africa. This mirrors our observer community and our dataset of photos to train on. When we have more images from an area, we have better performance. There are some other interesting questions to explore about why our models perform better or worse on some taxonomy groupings like herps or mollusks compared to birds or plants.

However, in the context of understanding this alternate experiment, we can takeaway that two models trained on the same taxonomy but different photos will see a small degree of variance in both computer vision and combined performance on the same test set.

Here's a look at how the 2.11-alternate model improved (or didn't) when compared directly against the 2.11 model:

We can see from this that we can probably expect less than a 2% variance from model run to model run based on sampling.

From this, we can do a better job of characterizing our model accuracy month-to-month: are things improving, getting worse, staying the same? It looks to us like our models are staying about the same in average accuracy while adding 1,000 new taxa a month. We think that's a pretty great result.

Another great way we can use the this experiment is to judge how much a change to model training or technology is really improving model accuracy. If we try out a supposedly better piece of technology to train a future model, but we see less than a 2% accuracy bump, then we should probably be a little suspicious.

We're excited to share the results here with you, and we plan to share accuracy metrics about our models when we release them from now on.

Posted on March 7, 2024 04:26 PM by alexshepard alexshepard

Comments

Interesting to read, thanks for sharing.

Posted by rudolphous about 2 months ago

Am I missing something - what does ML stand for?

Posted by jdmore about 2 months ago

@jdmore sorry for the oversight - ML here stands for Machine Learning. I've updated the text.

Posted by alexshepard about 2 months ago

Thanks, makes perfect sense now!

Posted by jdmore about 2 months ago

Great to know that sampling variation alone doesn't account for a huge amount of error variation even across taxa!

One interesting application/extension of this might be to do a sensitivity analysis for number of training photos needed - that is, running different taxa with different numbers of photos and seeing if there are differences in the optimal number of photos/accuracy tradeoff. Is there value for some taxa in including more photos/taxon where a substantial improvement in accuracy is seen? Conversely, maybe some taxa need fewer photos for accurate training and reducing the training set for these could lead to quicker training (assuming the model doesn't require perfectly balanced numbers of training photos for all taxa).

Posted by cthawley about 2 months ago

Excellent, I'm looking forward to this data being released monthy.

If you're going to release more metrics it would be super cool to make a data viewer page where you could look at the model accuracy on any arbitrary taxon included in the model

Posted by kevinfaccenda about 2 months ago

Great results and write up! I agree with Kevin that it’d be interesting to make a dashboard or viewer page to look at any taxon

Posted by sm356 about 2 months ago

Thank you for the insights.

Posted by hedaja about 2 months ago

https://www.inaturalist.org/observations/201954548
I find CV a useful tool. Here it took me to Tribe.
Then I use the map for Seen Nearby to find a possible species.
And this is for a sp with only 3 obs on iNat !!
Accurate ? Wait and see.

Another K sp each month - makes it worthwhile to dig thru old broader IDs to give CV a fresh chance. Especially for us in Africa - we move from NO Idea! to Maybe this?

Posted by dianastuder about 2 months ago

Nice article thank you! A few comments and questions:

a. It should not be a question of "better or worse photos", but rather about how similar the sample of photos is to any subsequent submission for analysis. And it is not just about quality of images, or even the angle of view etc., but can also include any kind of correlated association. E.g. anyone with much experience of submitting more challenging / rarer species may notice how the model often latches onto some common types of background being associated with certain species (e.g. because they are usually photographed sitting on a leaf, or against a stony background, or in water* etc.). This point is also relevant to the discussion on dataset bias and accuracy as picked up in point 4 below.
[*freshwater plants are often identified as being marine – presumably something that could be improved upon via the geographical grouping work?]

b. "The y-axis shows top1 accuracy as a percentage.."
What is this supposed to mean? Presumably it is not referring to some top 1% of 1,000 observations, which would only leave you with 10 data points for each bar. The numbers look rather more like the overall percentage of each batch of 1,000 that was deemed by the community to be accurate?

c. Yes, the faint colours are difficult to discriminate. But we can still distinguish them easily enough by their consistent ordering. Maybe it would be clearer to split it out into two separate charts? I.e. one to show the difference between "vision" and "combined" and a separate one to show 2.11 vs. 2.11-alternate (or even just skip this latter one, since this conclusion is probably more easily appreciated from the final chart in any case). It is confusing to the reader to say "Here's a chart showing the two models" when referencing a chart showing four different datasets.

d. The dataset bias is presumably also heavily impacted by what proportion of each sample set is likely to come from more commonly encountered and photographed subjects. E.g. in my experience the model tends to perform rather poorly on plants (and noticeably worse than say insects, mammals or reptiles), but presumably scores highly here due to how large a proportion of total plant observations are of the same most commonly encountered species.

e. y-axis is not labelled or clarified in the second plot. Presumably these are percentage point changes? Again, what is meant by "top1%" is not clear.

f. The variation from run to run is presumably dependent upon the number of photos included in each training set. I would expect that as the number of photos is decreased, the variation in accuracy from run to run would increase, and vice-versa.

g. Have you ever looked at the impact of (low match scoring) outliers in the training set, or considered whether there is any way to better exclude or integrate them into the model (e.g. with a second pass training after filtering for outliers)? I wonder if there are some kinds of outlier that are better excluded, or others that we might wish to prioritise as being under-represented (e.g. there are not many images of the underside of moths...) Presumably once you're up to 1,000 images, the impact of outliers in most cases is smoothed out to being negligible, but...

h. What about the impact of different presented forms of the same organism in the training set? There can be significant sexual dimorphism, numerous distinctly different life stages as with many insects (gall, larva, pupa, nymph etc.), or representations or evidence for an organism residing in totally different modalities – scats, tracks, discarded parts like feathers, signs of feeding and other behaviour etc. Combining all of these, I imagine for some species there could be a significant variation among even 1,000 images in the relative proportions of all these different possible representations.
The fact that we have annotations to indicate these different representations then raises the question of whether there is much benefit to be had in splitting them out into separate training sets (and thus the model would gain the ability to distinguish e.g. male from female for that species), or whether the ML algorithm is better left to its own "black box" interpretations in taking care of them all as one. I guess the answer is likely to depend on the species in question, and particularly on whether there are enough images available to make sub-categories a statistically viable proposition.

Just some food for thought ☺

P.S. I had to switch from using a numerical to an alphabetical list in the above as otherwise the system automatically applies some broken formatting algorithm which insists that there should only be paragraph spacing between the first two entries of a numerical list. The numerous attempts I made to fix it with additional spacing, or by duplicating the numbers as "1. 1.", "2. 2.", etc. either had no effect or made things even worse XD

Posted by bsteer about 2 months ago

Interesting. I wondered for sometime if an explanation exists for the vision model identifying highly frequently the terrestrial orchid species Spathoglottis plicata for many orchid species even if they look quite different. I have identified more than two thousand Spathoglottis plicata records and encountered many errors with other Spathoglottis species and horticultural cultivars that makes sense and is not unexpected, the general shape of the flower being similar. In that case, may be the dataset used for the training of the model already contains a significant level of mixed species and cultivars. But, much more amazing is that it is also not uncommon for very different species with structural differences of the flower (including epiphytics, genus Dendrobium or Phalaenospis, and horticultural hybrids) in place of proposing simply "orchids", or some of its tribes. As S. plicata has, for an orchid species, very little variability (only 2 major types in the world), I would be interested to know if there is an explanation for why it is the name proposed for many orchids records unindentified by the observer.

Posted by chacled about 2 months ago

@alexshepard Have you ever compared a computer vision driven systemic classification approach (using traditional classifiers to determine features that would occur on traditional identification keys) to a machine learning driven, systemics-naive approach (which I assume you use now)? I would be very interested to understand how that compares. Furthermore, in a small subset of specific cases that may allow for some potentially useful developments: (1) a capability for machine generated identification keys (2) capability for machine generated unknown organism descriptions (3) the timely issuing of guidance to observers as to which features to collect in order to better attest an identification at the species level (eg. leaf, bark, flower, etc.).

Of these potential outcomes, #2 might be the most exciting as it could potentially help to bridge the gap in vocabulary between the scientific community and the general public. A learning mode with mouseover-driven concept definitions might help there too.

Posted by pratyeka about 2 months ago

Pertaining to examples of photos where the CV makes very inaccurate suggestions, or makes suggestions of widely unrelated taxa, I've noticed that the CV seems not to assess depth and/or size/proportional size well and seems to focus more on common color patterns (of the entire photo, including backgrounds). E.g., a close up macro photo of a butterfly's wing or spots on a beetle can get CV suggestions of various unrelated much larger animals. It seems that the CV may have trouble with many (but not all) macro photos in general, despite that those are often the highest quality photos from an ordinary standpoint. There are a few CV/ML-related things that still would be ideal to clarify:

How many different versions of visual-recognition-based ID suggestions are there? E.g., using the desktop version, are there differences between using CV on a fully opened observation vs. using Visually Similar in Compare or Suggestions (whether on a fully opened obs. or when viewing an obs. from Identify mode)? There seems to be, e.g., they can give different suggestions. If so, why are these different and wouldn't it be ideal for them all to use the same system, the most accurate one? A less clear question is whether the CV differs between the desktop version, iOS app, and/or Android app. Again, if there are any differences, it would seem ideal to use the same system, the most accurate one.

Posted by bdagley 24 days ago

Finally, we do know that the Seek app's CV differs from all of the others at least in not reflecting Nearby / Predicted / Expected Species, which itself greatly lowers it's accuracy. The admin stated they plan to fix the Seek CV so it reflects Nearby what seems like 2 or more years ago now, but it hasn't been fixed yet. I've never used Seek, but it would be helpful if possible to add larger disclaimers to Seek explaining that users should consider whether a species is in range on their own because Seek will show out-of-range ID suggestions.

A similar issue, even on the desktop version, is that once in Compare/Suggestions, you can filter by location and taxon, but then when you also select the Visually Similar filter, it seems to no longer take "Nearby" suggestions into account, but does this without notifying users, and so which can mislead users. It would be ideal to fix this, analogous to the Seek issue, or at least to add disclaimers that Nearby is no longer taken into account.

Finally, in any version, e.g. the desktop version, it would be ideal to detect cases where the CV has no "We're Pretty Sure It's _" suggestion and shows a list of very unrelated taxa, and add a disclaimer to the top saying something like "We're very uncertain so consider these suggestions with caution." Or, to possibly in cases like that not to show any suggestions at all, due to high uncertainty. I'm referring to cases where e.g. an insect photo gets suggestions of trees, birds, fungi, mammals, etc. In short for this point, the fact that multiple very unrelated taxa that also differ greatly in size, morphology, and sometimes even color pattern are suggested side by side should itself indicate that the visual recognition system is having high difficulty recognizing an image and so shouldn't be used. A third possibility to address cases like that would be to only show very broad level suggestions, e.g., to suggest the closest parent taxon all of the individual suggestions are a member of.

Posted by bdagley 24 days ago

@bdagley @alexshepard I have also noticed the same issue with tight macros/microscopy/partial organism images. The ML pipeline could presumably be improved with a pre-classification stage so that, for example, tree species identification could be separately trained on leaves, bark, flowers and seeds or combinations thereof. This would remove undue crossover in this subset of cases and should be readily achieved in most cases.

Posted by pratyeka 24 days ago

Add a Comment

Sign In or Sign Up to add comments