New Vision Model Training Started

We've started training on a new model, which will be our first model update since July 2021. Here's what you need to know.

It’s bigger

iNaturalist data continues to grow. This time around we went from 38,000 to 47,000 taxa, and from 21 million to 25 million training photos.

We’ve sped up training time again

iNaturalist has new computer vision hardware!. We have two more NVIDIA RTX 8000 GPUs, again granted to iNaturalist by NVIDIA. Based on early experiments, three GPUs seem to train about twice as fast as a single GPU in flat-out training speed. We also have a new computer vision server to house these GPUs, which has 4x the RAM, a hugely faster CPU, and really fast disks (at this scale, reading photos from disk and writing data back to disk is a limiting factor).

This training run is starting with the last checkpoint from the previous training run, rather than starting from the standard ImageNet weights like we did for the previous training run. Basically, this training run gets a head start in understanding what kinds of visual features are important for making iNaturalist suggestions.

We changed a few things about how we generate training data

Hybrid species are no longer included. The previous training run was the first time where we had a significant amount of training images for some avian hybrid taxa (mallard hybrids, for example), and including them as training categories really confused the model. Despite mallard being the most commonly observed species on iNaturalist, the most recent model had a hard time getting mallard suggestions right, struggling to tell the difference between mallards and the various mallard hybrids. Excluding hybrid species from the training set should keep the computer vision model on the task of trying to distinguish between visually distinct taxa.

The number of photos from an individual observation that could be included in the training set has been capped to a max of 5 photos per observation. Previously, in very rare instances, a single observation with hundreds of nearly identical photos could have dominated the training data for a single taxon, potentially causing the model to learn visual features from just those photos, to the detriment of generalizing well to others’ photos.

We changed one thing about how we train the model

Label smoothing is back in our training config. Label smoothing sets the “true” training labels to “softer” values like 0.9 instead of “harder” values like 1.0. Basically, when the model is shown a photo to learn from, we’re now saying “we’re very confident that it’s species X” instead of saying “we’re 100% convinced that it’s species X.” It’s designed to reduce overconfidence in model scores. This is something that we’ve done with some models in the past, but this configuration got lost in the transition for the previous training run.

When will it be ready?

This new model will take a few months to train, and then a few weeks to test / validate before we decide it’s ready for deployment. The increase in training images is slowing us down, but the new hardware is speeding us up. Starting from a checkpoint from our previously trained model also reduces the amount of training work that needs to be done.

Future Work?

The main priority is to get to our stated goal of training 2 models a year.

We're trying to be more transparent about when and how we train new models, so we’ll be working on changes for that. This post is a start, but you can expect a dashboard with more details and charts coming soon.

We have a grant from Amazon to explore ways to improve how we export and train our models, and we’ll be working on that before the end of the year.

Finally, we now have two computer vision systems, one dedicated to production training runs like this one, and another which we can use to run experiments to improve future training runs and explore other machine learning-based features for iNaturalist.

Posted on October 21, 2021 09:33 PM by alexshepard alexshepard


Thanks for the update. Most of this stuff is over my field biologist head, but I sure appreciate the results!

Posted by rogerbirkhead about 2 years ago

ooh neat!

Posted by charlie about 2 years ago

The graphs seems to be a bit small. I love graphs if i am able to read them. This is 7 ?
May 2017 Model 1 , 2-20 photos per species
Aug 2017 Model 2, 40 photos per species
Jan 2018 Model 3
Feb 2019 Model 4
Sep 2019 Model 5, 1000 photos per species
Mar 2020 Model 6 March 2020 model did and it had ~21000 species and ~2500 genera.
July 2021 Model 7 (up from 38,000)
Okt-Dec 2021 Model 7 (up from 38,000)
April 2022 Model 9
Aug 2022 Model 10.0.1 includes 60,000)
Sept 2022 Model 10.0.2 (+5000 species, includes 65,000 taxa (up from 60,000)
Leaf model If it is difficult to understand the childs, leaves the AI/CV suggests the parent ??

iNat Computer Vision Training (Number of Images per Traing)

iNat Computer Vision Training (Number of Images per Traing)

iNat Computer Vision Training (Number of Images per Traing)
In the tree that our vision system sees, genera can be leaves if we don’t have enough photos to train any of the child species but we do have enough photos to train at the genus level.

Posted by ahospers about 2 years ago

All of that sounds very exciting, @alexshepard, and I look forward to the output!

May machine learning understand how organisms recognize themselves ;-)

Posted by jakob about 2 years ago

Sounds good although I am a bit worried about the hybrid part. There are hybrid plant species which are very common, some even more common than the parent species. I can see that they could confuse the model but does that mean that they won't be suggested at all in the future?

Posted by pastabaum about 2 years ago

Thanks for the update. It is interesting to hear the changes, though I don't understand all the jargon.

It may be implied in the text somewhere, but what was the cutoff date for being included in the new model? Just wondering if changes on a genus I've been working on will be in this model or the next one.

Also, as someone mentioned above, the graphs are very small. In fact, I can't read the text on them at all even when trying to zoom in.

Posted by rymcdaniel about 2 years ago

Removing hybrid species is a not an ideal option. There are a lot of "one-off" hybrids but many are defined species and are iconic and common in several parts of the world. Cutting these from the training just limits the number of taxa, and this may include a good number of commonly seen species (mostly plants). The AI should (rightfully) have trouble with extremely similar species, and it confuses similar insects as much as it may confuse a mallard and a mallard hybrid. This feels like an unjustified issue.

When I was travelling to countries where mallard hybrids were common, it rightfully suggested those options and often it was correct. I've never had the reverse problem (suggesting hybrids where they aren't expected).

It needs to be accepted that the model will not always be perfect, and that users still need to make the right educated call. Worst case, other identifiers help correct it. But removing hybrids from the AI is not the answer.

Posted by silversea_starsong about 2 years ago

I have to agree with @silversea_starsong and @pastabaum. We should accept that the computer vision isn't going to be right 100% of the time, I don't think removing taxa from it is gonna solve much other than making the computer vision more certain under some circumstances. We should be encouraging users to use it as a tool with their best judgement in mind rather than relying on it and basing their IDs off it entirely.

Posted by swampass about 2 years ago

The cap of 5 photos per observation seems maybe good if there really are observations that have a ridiculous number of photos. If limited to 5, it seems maybe an analysis of those photos should be done to make sure the more distinct ones are retained or representatives of the most similar are left out. Possibly a similar problem is that I've noticed certain individual plants are photographed a lot, which probably causes training problems as well. For example, there is literally only four plants of M. fremontii at the following location and that is about 25% of all observations of that species.
A geographic weighting of photos could be useful in those situation but I'm not sure if that wouldn't have its own issues. That said, having photos from the same spot multiple times of the year is probably a very good thing.

Posted by keirmorse about 2 years ago

I second the suggestion not to remove hybrid species.

Posted by bdagley about 2 years ago

I also agree with others that removing common hybrids is not ideal. If they are common and the computer vision model is confused, then that just reflects the reality of the situation. I can see how it could muck things up though. I assume you have or can run analyses to see what the effect is of including and excluding hybrids. I'd be curious to see the results of that.

Are you including subspecies and varieties yet? Seems like that should happen as so many are just as or more distinct than some species are from each other.

Posted by keirmorse about 2 years ago

Apologies to folks who will miss hybrids in this run, but we want to make sure we understand our basic assumptions/goals for the model which is if there’s a lot of data (e.g. mallards) we should either be suggesting the right species with high accuracy or rolling up to a correct but coarser common ancestor. This isn’t happening with some taxa like mallards in the current model and we think figuring out whats going on by focusing on non-hybrids this run will help us get back on track.

Posted by loarie about 2 years ago

Exciting update. Is there a good reference where we could learn more of the technical details undergirding the model?

I'm sympathetic to removing hybrids. My assumption is that the cv will list both hybrid parent groups as high-scorers, and that it'll be easy for experts to pick out the hybrids from the parent groups -- so this seems like a good way to get major lift from the computer vision while reducing the confusion on straight-up species.

I wonder if species complexes (i.e. can't be visually distinguished) is a converse example? I wonder if clades might have a "cv-eligible" flag that is flipped off by default for hybrids and on by default for species, but overridden for common hybrids and for species in complexes.

Posted by schizoform about 2 years ago

Not every scenario is statistically consistent (i.e., where adding more data by definition will help you converge on the true answer). Sometimes you need to change the types of data and analysis to tease out those tricky "positively misleading" anomaly zone bits and get your model to converge toward the right peaks. With that in mind:
-- I'd guess that iterative model training would have better luck with the hybrids, and it's something people have been asking about for years. Asking a one-size-fits-all model to tell a mallard from a crocodile but also a mallard from a mallard hybrid in one fell swoop will always be a tough ask.
-- Also, any chance your models are planning to draw from valuable data pools like user verified false positives? For every species there's a list of other taxa commonly confused with X, and not using that to inform your priors seems like a waste of good information.

Posted by tristanmcknight about 2 years ago

Great work! I echo the concerns of other people, about excluding hybrids. I can think of several hybrid plants that are very commonly observed, if mainly in cultivation and relatively easy to distinguish (e.g. Bauhinia × blakeana, Acer × freemanii, Aesculus × carnea), and it would be a pain to correct them without CV nudging people towards the correct answer. Of course, I defer to the experts.

Posted by someplant about 2 years ago

It's less about missing hybrids and more that it's cutting out a portion of commonly-encountered species, which reduces the use of the AI. It's saved my ass a few times especially when dealing with waif or cultivated plants that are naturalizing, and I was not familiar with the species before.

Unfortunately it means some groups like Viola x wittrockiana are going to become plagued with misIDs again, something that took me and a few others a very long time to correct. That species is misidentified regularly as one of the two parents, which are separate entities and should be kept separate in data.

Posted by silversea_starsong about 2 years ago

Don't suppose there's data that shows which hybrids are most commonly selected with the CV ID, and then reach research grade? That might help evaluate which taxa are going to be most affected by this change going forward.

Posted by silversea_starsong about 2 years ago

These tools could be developed in collaboration with leading experts to enhance human capabilities but in my experience they too often are a shortcut that bypasses or ignores existing human capacity and furthermore neglects biogeography (without deep consideration and incorporation of this no such system will work).

There need to be a lot more negative constraints regarding which taxa should not be identified by the automated tools (to prevent overgeneralizing and routine misapplication of names when this should not be done) and to ensure that all suggested identifications are within the plausible range for the species. Relevant taxonomic and geographic authority files exist but are not utilized properly (one part of the marginalization of the collections-based taxonomic expert).

Not clear why the technology cannot be used already for truly useful things such as blocking all attempts to report taxa from outside plausible ranges? For example, I am constantly having to correct Apis cerana (Asian Honey Bee) to Apis mellifera. A useful system should know that only the latter species can occur in the New World, Africa, and Europe.

Posted by johnascher about 2 years ago

Thanks @apseregin :)

Posted by alexshepard about 2 years ago

Hi y'all, I have a few responses:

re: hybrids - As Scott said, we need to get the basics right first. The vision system is just that, a visual classifier. It can only distinguish things based on their visual differences. Differentiating between mallard & mallard hybrid is to me a different kind of visual task than differentiating between two species of thrush for example. I'm not sure if I'm alone in thinking this, but to me a mallard hybrid is simply not as distinct, visually or conceptually, from a mallard, when compared to the differences between two other species in another genus, or even two species in Genus Anas. Distinguishing mallards from mallard hybrids visually is not an easy task, and mallards the most commonly encountered organism on the site. As our model grows and we ask it to predict more and more taxa based on more and more training images, we need to make sure we're getting the basics right before we expect the model to understand subtleties.

re: biogeography - In March the team released changes to our suggestions UI to exclude suggestions of taxa that do not occur nearby by default. (See for more info). Users now have to choose "show suggestions that do not occur nearby" in order to even see non-nearby suggestions. Hopefully that should have resulted in significant improvements in the cases mentioned by John. We're still looking at to approaches to improve in this area, both from the UI & algorithm side of things as well as the modeling side, but we don't have anything to share yet.

re: a "cv-eligible" flag - this is a good suggestion, and I believe something similar has been suggested in the forums as well. We've mostly been operating on faith that these models could learn to distinguish almost anything given enough images, but we're slowly learning some of the ways that our faith might have been misplaced. Removing hybrids is an admittedly blunt first step towards correcting this, and I think we'll be exploring an approach like you describe next year. Not sure when it'll make it into the site.

re: "iterative training" - We are starting from a previously trained checkpoint this time, so I believe our models are being trained iteratively. Unless you mean something different?

re: "one-size-fits-all model" - yeah, this is interesting. On the one hand, all the literature I've read suggests that these large visual models do contain enough complexity to model this many distinct classes, with both fine- and coarse-grained classes represented. The most recent example that to me supports this is Abnar et al's paper Exploring the Limits of Large-Scale Pre-training - On the other hand, like you, my intuition (and that of others on the team) struggles with this. We will most likely explore an experiment in this area with our AWS grant later this year. If you have a paper, experiment, or code that you can point to that expands on your suggestion, I'd love to read it. Thinking ahead, I'm also curious about how to balance any improvement in overall accuracy with the increased cost of running & deploying multiple models in production and on device in places like Seek.

re: "using false positives to inform priors" - I'm not sure how to apply this to a model architecture like a visual classifier. Can you expand at all?


Posted by alexshepard about 2 years ago

I few more thoughts:

The wording accompanying different kinds of CV suggestions could be updated, or greater addressed in new-user tutorials, to explain CV limitations and not to rely alone on CV (or any superficial "matching" of images).

Out of range IDs somehow still occur, despite "nearby" and all-locations options (which may be problematic for those using without checking range). It would be useful if someday, more range info. was added for species (for CV or otherwise), e.g. locations where recorded. To know not just that a species isn't "nearby" a given within-country location, but also doesn't occur in the country. It may then be possible to exclude suggesting species that haven't yet been recorded or caution that they may be out of range (although range expansions would also need to be accommodated).

Re: hybrids, if efforts truly wanted to improve accuracy they may also incentivize/improve policy to improve photo quality, since I assume poor photos also makes a negative contribution to CV. Hybrids seem an unlikely choice to target first as a problem. Also as some noted, "nearby" in some cases allows hybrids to be correctly suggested (so CV knowing hybrids doesn't only rely on there needing to be visual differences).

Having said this, I am looking forward to the updates and CV continuing to improve.

Posted by bdagley about 2 years ago

Are these trained models available for download? They could potentially be very useful for academic research. I would personally love to be able to adapt the model for some semi-automated annotation of coral reef transect photos.

Posted by rmcminds about 2 years ago

They are not, but I believe a pre-selected set of data is available.

Posted by astra_the_dragon about 2 years ago

I just noticed something else interesting, CV suggestions seem to also incorporate host flowers of specialist pollinators. C V suggests the eastern squash bee here (, based on flower. In this case the insect isn't a bee so CV was wrong. Still this may be useful for photos of the corresponding bee and flower. I assume CV detects what features are similar or differ between many obs. when training to suggest in this way, "seeing" entire images instead of picking out "insect," "flower," etc.

Posted by bdagley about 2 years ago

"been operating on faith that these models could learn to distinguish almost anything given enough images"

This is one of many cases where it surely would have been useful to consult those with relevant expertise from the outset.

A very large proportion of observations can be identified reliably if geography is taken into account but cannot be identified reliably from images alone. Consider an American Crow from California. Visually these usually cannot be distinguished from Fish Crow or Northwestern Crow and, globally, there are likely other species such as Carrion Crow that are too similar to routinely separate from lower quality images alone.

There are too many such cases to count and these are very well known to the experts.

For bees consider Halictus ligatus sensu lato. These can reliably be called H. poeyi in Florida and Halictus ligatus (sensu stricto) in the Western USA but in much of the SE USA it is not safe to separate them visually.

Given the nearly endless and very well documented cases where species are easy to separate by range but impossible or nearly so to differentiate visually it is not clear to me how the importance of accounting for biogeography was not obvious to all concerned from the very beginning?

Posted by johnascher about 2 years ago

@johnascher :

First, thanks for your significant and continuous effort on the iNat platform. We're all in your debt.

Second, do you have any concerns about emphasizing reliance on biogeography in the current environment of accelerating climate change? I've come to appreciate how much geography constrains IDs, but also how quickly geographical constraints are shifting and how important it is to track them.

Posted by schizoform about 2 years ago

Hi @johnascher

I made the "operating on faith" comment in the context of a cv-elgible flag, and was not talking about geofrequency there.

The first vision model ever released on iNaturalist included a geofrequency component. It has been tweaked several times over the years, most recently in March. But we have never had a belief or faith that a CV model could make great suggestions in all contexts without a geo spatial component, and my comment about "operating on faith" was not meant to imply that.

If you're trying to say that we're still not doing a good enough job at it, then I agree, which is why I said earlier that we're still working on it.

Best wishes,

Posted by alexshepard about 2 years ago

Are you still working on getting the most recent model into Seek? According to the July 13 anouncement the latest model went into the iNaturalist API server but Seek did not have it yet.

Posted by joergmlpts about 2 years ago

@joergmlpts Yes, we're still working on that - Seek isn't using the latest model yet.

Posted by tiwane about 2 years ago

Add a Comment

Sign In or Sign Up to add comments