Experiments to estimate the accuracy of iNaturalist observations

One of iNaturalist's core goals is generating high-quality biodiversity data to advance science and conservation. We are launching some experiments to better understand the accuracy of these data. Here’s how they will work:

Step 1 Generate the sample

We draw a random sample of observations from the iNaturalist database of observations.

Step 2 Find potential validators and distribute sample

We choose potential validators and distribute the sample among them, considering their past activity identifying observations on iNaturalist (more details in the FAQ below). We assign the same observation to multiple validators to increase the odds that a large fraction of the sample will be reviewed.

Step 3 Contact potential validators with subsamples, instructions, and deadlines

We send emails to each validator with a link to their subsample loaded in the iNaturalist identify tool, instructions to identify each as best they can, and a deadline after which we will use the new identifications to assess the accuracy of the sample.

Step 4 Validators add new identifications to their subsamples

The instructions are for validators to add the finest identification they can to each observation. We’ve included the instructions in the FAQ below if you’re curious about the details. We know this means that some observations that are already Research Grade might get a flurry of redundant confirming identifications.

Step 5 Assess accuracy by comparing validator identifications to the previous identifications

The top level statistic we are aiming to estimate is Accuracy (the percent of the sample that is correctly identified). We will do this by assuming that the new identifications added by validators are accurate and comparing them to the observation taxon (more details in the FAQ below) to classify observations as correct, incorrect, or uncertain. We use these classifications to calculate high and low estimates of accuracy.

If the sample size is large enough, we may be able to understand variation in accuracy by dividing up the sample by geography, taxonomic group, quality (research grade etc.), and other characteristics.

Our first experiment

We’ll be piloting this protocol with our first experiment later this month (Experiment 1). We’ve already generated the sample (Step 1) and selected potential validators (Step 2). We plan on emailing the potential validators on January 17th (Step 3) with a deadline of January 31 to give validators two weeks to identify their subsamples (Step 4) before we share the results the first week of February (Step 5).

For this first experiment we generated a modest sized sample of just 1,000 observations. We distributed it among 1,219 potential validators attempting to assign each observation to at least 5 validators in order to increase the chance that the observation will be reviewed. Here are some characteristics of the observations in the sample from Experiment 1:

Thank you so much in advance if we contact you as a potential validator and you choose to participate. We couldn’t do this experiment without your help and expertise!

Frequently Asked Questions

How exactly are you selecting potential validators?
If an identifier had made at least 3 improving identifications on a taxon, we considered them qualified to validate that taxon. Improving identifications are the first suggestion of a taxon that the community subsequently agrees with.

For example, Identifier 1 adds an ID of Butterflies to an observation. If Identifier 2 later adds a leading ID of Red Admiral, Identifier 1's ID on Butterflies becomes an improving ID. If Identifier 3 later adds a supporting ID to Red Admiral, Identifier 2's ID in Red Admiral becomes an improving ID.

Note, we count both Identifiers 1 and 2 as having 1 improving ID on Butterflies (since Red Admiral is within Butterflies). Only Identifier 2 has an improving ID on Red Admiral.

How will I know if I was selected to be a validator?
You’ll receive an email from iNaturalist titled “Will you help us estimate the accuracy of iNaturalist observations?”

How large are the samples you’re sending to validators?
It varies. For Experiment 1, many validators are only being sent a single observation. No validator is being sent more than 100 observations.

What if I can’t identify an observation in my sample
Please add the finest identification you can add based on the evidence in the observation. Even if it’s ‘Birds’ or even ‘Life’ that’s ok. We won’t learn anything from non-disagreeing identifications that are coarser than the observation taxon but that’s ok. The only thing that will really hurt our assumptions is if you add an incorrect identification.

What if an observation in my subset has no photo or there are other issues like missing locations?
We've excluded observations without media (photos or sound) or with the no votes on the "Evidence of organism" data quality flag from subsamples. Observations with other data quality flags like missing locations may be included. Please do your best to identify them despite the issues.

I don’t want to add confirming identifications to observations that are already research grade.
We realize that this can be undesirable - e.g. some identifiers like to preserve their reputation of not “piling on” etc. But we need new identifications on all observations to estimate accuracy so we appreciate if your help with the assessment by adding a new identification in these cases. If it helps, feel free to mention that you’re participating in this experiment and linking to this blog post in your identification remarks.

What happens if multiple validators of the same observation give different results?
If the multiple validators all agree (i.e. if their identifications are of the same taxon or one is a coarser non-disagreement to another), we will choose the finest taxon as the “correct” answer. If multiple validators disagree (i.e. if their identifications are on different branches or one is a coarser disagreement to another) we will investigate these conflicts on a case-by-case basis to decide how to proceed. We’re hoping these conflicts will be rare.

What if I’m 100% sure of the genus but only 90% sure of the species? Which should I identify it as?
Unfortunately, there’s a degree of subjectivity involved in an identifier's particular comfort level. For example, some identifiers need to be able to see specific characters to feel confident enough to add a specific identification. Others are more comfortable using things like location as constraint (e.g. “I can’t positively rule out other members of this genus based on the photo but there’s only one species that occurs here”). Please identify as fine as you feel comfortable adding, but if you need a rule of thumb, add the finest identification that you think is 99.99% correct. In other words, if you think there’s less than a 0.01% chance that the out of range look-alike could have hitchhiked to the location then it's fine to choose the in range species even if you can’t see the diagnostic character.

What assumptions are you making when you estimate accuracy from this experiment?
We’re assuming that the sample is representative of the entire iNaturalist dataset and that validators do not add incorrect identifications. The larger the sample size, the stronger first assumption becomes. We've only selected validators with at least 3 improving identifications for the respective taxon, but that doesn't mean they never misidentify that taxon. We've added redundancy by attempting to select 5 validators for each sample, but some samples have no qualified validators and we know we wont' get a 100% response rate.

What happens if you don’t get much of the sample reviewed, either because no one participates or because no one can add correct identifications?
If we can’t get an observation in the sample reviewed or the validators can’t add an identification as fine as the previous observation taxon, we will code it as "Uncertain". We aren’t making any assumptions about uncertain observations, but they are increasing the uncertainty in our estimates. The worst case scenario is that we have so many Uncertain observations that the bounds on our accuracy estimates are too broad to be useful (e.g. accuracy estimates with a low of 50% and a high of 100%).

What if I’ve already identified an observation in my subsample?
Please review your old identification and if it is still relevant you can skip it. If you no longer think that your older identification is correct, please add a new identification.

What if your sample size isn’t large enough to get robust estimates
That’s possible. We won’t know until we get a sense for how much participation we get, the portion of Uncertain observations we’re left with, and how the community responds to this pilot. If the response is good, we can increase the sample size in future experiments. This will likely be necessary if we want robust estimates for under-represented regions and or taxa (e.g. African fish).

How will we get to see the results of the experiment?
We’ll post a report summarizing the results of the experiment to the iNaturalist blog a week after the experiment deadline. We'll comment here with a link when it's up.

What instructions are you sending to validators?
We’ve copied them below in case you’re curious (note: instructions will have sub samples tailored to each validator. The 10 here are just meant to serve as an example):

Dear loarie,

Will you help us with a study to estimate the accuracy of iNaturalist observations?

Identify all observations in the link below as finely as you can, even if they are Research Grade, and even if your finest ID is at a higher taxonomic level (even kingdom).

If you see the “Potential Disagreement” popup…
- and you are confident it’s mis-ID’d click the orange button,
- or if you’re uncertain beyond your ID, click the green button

Do this by 2024-01-31

Here is the subset of 10 observations that we think you can identify based on your activity on iNat. Please add the finest ID you can to each of the observations before 2024-01-31.

We’ll calculate accuracy by comparing your ID to the Observation Taxon. You can skip observations where you’ve previously added an ID if that ID is still relevant. For more on how we will count agreements and disagreements, keep reading.

IDs equal to (A) or finer than (B) the Observation Taxon will be counted as Agreements.

IDs on different branches (C) or coarser than the Observation Taxon where you choose “No, but…” to the “Potential Disagreement” dialog (D) will be counted as Disagreements.

IDs coarser than the Observation Taxon where you choose “I don’t know but…” to the “Potential Disagreement” dialog (E) will be counted as Uncertain.

We’re so grateful for your help as an identifier on iNaturalist, and thank you very much for participating in this study. Please read this blog post to answer frequently asked questions about this experiment.

With gratitude,

The iNaturalist Team

Posted on January 17, 2024 07:00 AM by

loarie

Comments

Sounds interesting.

Posted by sedgequeen 4 months ago

Note that in the instructions the link for the 10 observations are tailor-made for @loarie - please wait for your own letter - with your own set of observations - before getting excited and start identifying them.

Posted by tonyrebelo 4 months ago

Urgently necessary while looking on Tracheophyta. Especially for common and "simple" species.

Posted by chbluete 4 months ago

Great there is work being done on this!

"If an identifier had made at least 3 improving identifications on a taxon, we considered them qualified to validate that taxon"
At first glance this seems like a confusingly low bar - why so low? ....and if this is countered by the use of 5 people, are you taking the top 5 for each taxon or are they chosen at random from the pool?

Posted by sbushes 4 months ago

Since hard disagreement provokes iNat to Ancestor Disagreement, some of us prefer to use the 'Uncertain' option.
Altho we are quite certain that it is NOT the previous ID.

Posted by dianastuder 4 months ago

Good point tonyrebelo - I edited to clarify:
“We’ve copied them below in case you’re curious (note: instructions will have sub samples tailored to each validator. The 10 here are just meant to serve as an example)“

Posted by loarie 4 months ago

I'm very interested in the outcomes of this study. Sounds great!

Posted by wdvanhem 4 months ago

Initiative highly welcomed. I can't wait to see the results.

Posted by chacled 4 months ago

Exciting! Am also very curious about the results

Posted by ajott 4 months ago

Wonderful! I’d love to participate!

Posted by tmessick 4 months ago

Have you thought doing this by regions to get even more accuracy (i.e., subsample from countries or subcontinents)?

I think it could be very useful because the tropical regions have less observation/species rates than the subtropical regions and it could bias the subsamples analyzed towards the template biodiversity.

Also, I'd love to participate too.

Posted by julianbiol 4 months ago

This is great that you're working to understand the accuracy of iNat IDs. I'm looking forward to the results.

Posted by lynnharper 4 months ago

Glad this is happening! I'm also curious if you'll be be able to say anything about how iNat accumulates identifications over time through this design. For example, what will be the differences for observations randomly selected for validators from, say, 2014 vs 2024? I've found the proportion of research grade observations increases with each previous year (ie 2014 has a higher % of RG vs 2024), and am curious on the degree to which it's time passing and more identifier effort accumulating, and the degree to which a bigger, more mature iNat is observing more taxa that are challenging to identify by pictures.

Posted by muir 4 months ago

That's super nice!!! Congrats to the team!

Posted by amarzee 4 months ago

I found interesting that the experiment sample consist mostly of RG obvs even with the big backlog of Needs ID observations on iNat, but I guess it has relation to the observations been from recent years.

Also, I really like the new illustrative images that you're using!

Posted by roysh 3 months ago

This is awesome! Estimating the accuracy of iNaturalist observations is really important and I'm so glad that this working towards that. I'd love to participate and help in any way that I can. I'm also very interested in the results.

Posted by cs16-levi 3 months ago

I always wondered what improving identification means. Thank you for the very clear explanation.

Posted by trscavo 3 months ago

Great! To improve researchgrade quality on iNat, observers that can‘t Id their own observation at once, should not be allowed to confirm a genus or species ID given from somebody else. This is the way, most wrong researchgrade ID‘s happen. Maybe it is possible in this study, to evaluate the percentage of these cases. I guess 80% to 90% of the incorrect researchgrade ID‘s on species level.

Posted by gernotkunz 3 months ago

@gernotkunz - I for one often only get round to identifying my observations a few days after uploading them. Especially the more difficult ones. Very often they are identified before I get around to them.
And often one just needs a nudge to be able to make an ID. If one is told the genus (and an ID to species does that), it is often easy to ID to species, and agree if needs be.
It is true that it is far easier to confirm that an ID is incorrect, than to verify that an ID is correct.
But there are many alternative options for improving ResearchGrade quality on iNat to that of preventing observers from confirming an ID on their own observations.

Posted by tonyrebelo 3 months ago

Yes, that is a problem, I agree. Observers will probably always find ways to reach researchgrade, if the want to ;-)!

Please sign me up for this!

Posted by lenrely 3 months ago

@gernotkunz, thats a good idea to track that. I think this is the pattern you mean? https://www.inaturalist.org/observations/identify?id=173073310,120509034,4437747,145682427,44334965,181472323,118825122,88674700,118433451,29166050,130240814,129751482,65443142,145030782,86262702,64517736,126659930,3014638,170400193,152737847,168242138,71267601,184398175,136668657,50248358,177686558,156913836,7720160,54673796,74654620,72573494,185995062,180272959,156804805,13922522,7222233,22688356,25360406,55895890&quality_grade=needs_id%2Cresearch%2Ccasual&reviewed=any&place_id=any
(this is from a test sample of 1000 that is different from Experiment 1 because I don't want to surface these from Experiment 1 until after the assessment, but we will track obs with this pattern for Experiment 1, there are 49 of them)

Posted by loarie 3 months ago

Also thank you to everybody who has expressed interest in participating. The way we selected candidate validators for Experiment 1 was just based on trying to find the smallest set of people with 3 improving IDs for the relevant taxa to cover the sample ideally with 5 IDers per sample. So we're not choosing based on willingness to participate, just based on an algorithm to try to meet those criteria.

I'm a bit disappointed that I wasn't selected as a validator for Experiment 1. Maybe I need to branch out from just IDing crabs...

But if we don't get a good response from this approach, we might need to focus more on recruiting from among people who are proactively willing to participate. So thanks so much for your interest!

Nice! I like it! I just reviewed my set of 9 observations

Posted by arielflorentino 3 months ago

Great experiment! I just got my set of five observations to validate. I do have a couple of suggestions for improvement of the experimental design. 1) One of my observations is casual because is has no locality information. I don't think casual observations should be included in this experiment because few identifiers are knowledgeable on a worldwide scale, which adds significant uncertainty to the experiment. Most people have strong geographical preferences when they do their identifications, and those should be taken into consideration when selecting samples. 2) One of my observations is from India. I am based in Canada, and I never identify observations of the sample taxon from India because this is far outside my area of expertise. The initial ID is at family level, which I am easily able to confirm but I am still wondering whether this sampling approach will provide very meaningful results. I am still glad, though that this study is being done. The experimental design just needs a little tweaking in my opinion.

Posted by matthias22 3 months ago

@loarie yes exactly! I see this pattern in most cases if wrong ID´s get research grade. You only need one "non-expert" who thinks he/she can ID to species level and the observer to confirm. In most cases "iNat-beginners", but also people that want to give their research grade observations a push. I must admit, in some cases I do that too, but only if 1. I trust the expertise of the ID´ing person completely and I don´t believe that somebody else could have the expertise to confirm the ID, or 2. I sampled the specimen and gave it to a specialist for genital verification. So in this case it is not possible for a specialist to verify the correctness of the ID. But of course, in both cases the ID can still be wrong, but not likely.

Welp, let me be the first to ask, "What do I do with an observation like this?" https://www.inaturalist.org/observations/10056133

Posted by stevejones 3 months ago

Would be good to publish the results of the experiment ;-)!

@stevejones I would filter the area for Papaveraceae and see if there are close up pics of the flowering species. Then post the link in the comments.

The problem is I know the species well (and its subspecies) but there is no way to determine a target organism, or to ID anything in the frame as the organism named.

I think adding a coarser non-disagreeing ID (at Angiosperms etc.) is the way to go which will be coded as 'Uncertain'

Thanks, Scott, sounds good; will do.

I received 12 observations to identify and I am wondering if we are allowed to make an effort to confirm an identification using online resources or only based off of what we know?

Posted by joemdo 3 months ago

@joemdo - please use any resource you can. For this experiment, we're interested in whether the Sample is correct or not, not the skill of the Validators - though that's interesting too!

Interesing... one of the observations that was selected for me to review was an RG observation with 5 of the same ID's. That taxon belongs to a VERY tricky group whose taxonomy is still very uncertain, so the community taxon ID was a reasonable one at the time all of the IDs were made (2 or more years ago). Since then, a new taxon (a complex) was created, which made those IDs less appropriate. A couple of the IDers are active and might change their IDs as a result of me adding mine. This brings up a couple of points:
1) I don't see how this would be possible, but it'd be interesting to have some way of seeing how many observations are "inaccurate" due to new information or due to taxonomic changes since the ID's were made. These factors may be making the final data look superficially worse than it really is. With the passage of time, the number of these now-inaccurate IDs will inevitably increase, so tracking this over time would be really interesting.
2) This is better suited for a different experimental design, but tracking the changes in IDs after a dissenting ID has been added would also be interesting. One of the features of science is that it is willing to update its beliefs with new evidence, so having some sort of measure for the iNat community's willingness to update their IDs would essentially measure the "self-correctingness" of the iNat community.

EDIT: Just in the time it took me to compose this comment, someone already changed their ID and therefore changed the community taxon to the complex :)

Posted by davidenrique 3 months ago

I guess the experiment wasn't set up to choose samples from a verifier's geographic area of expertise. It's difficult for me to know whether IDs are correct for observations from way outside the area I'm familiar with. I mean, I guess I could study up on the fauna of other areas, but who has time for that?

Posted by rcavasin 3 months ago

I and another iNatter were asked to review a poor photo. We both agreed that it was not able to be identified down to a species, or perhaps life at all. The original poster deleted the entry after our comments. I'm not sure if that was the intent of the experiment.

Posted by richardlitt 3 months ago

@rcavasin - just add the finest ID you can, even if its a coarse ID. @richardlitt, thats interesting scenario where the observer deletes the obs, I didn't think about that. That one will probably need to be coded as Uncertain.

Ok I got my 4 suggestions
One I thought I’d already ID but I didn’t
This one is definitely in this genus https://www.inaturalist.org/observations/118523513
I do a lot of cacti observations and the IDers want subspecies

Posted by ck2az 3 months ago

Done. You gave me really easy ones. (Not objecting.)

Posted by sedgequeen 3 months ago

With cacti I’m comfortable doing a ID to specie
But we definitely have some hardcore IDers on here that insist on sub-specie
I love desert flora and fauna and there is definitely many others on here that love it too
😊😊😊😊

Thanks everyone for the great response so far! At that time 261 validators have responded (21.0% of the people we contacted) and validated 680 samples (68.0%). That gives us accuracy bounds so far of:
Accuracy (lower): 0.598 +/- 0.0152
Accuracy (higher): 0.981 +/- 0.0043
Hopefully once more validators respond we'll be able to reduce the number of unvalidated & uncertain samples which will narrow those accuracy bounds to something more informative. We appreciate the experiment design feedback as well. If this overall infrastructure works for running these kinds of experiments, there's lots of tweaks we can make to the design.

We received only 2 in our subsample and they were both moths which we almost never identify. We work on caterpillars with k8. This wasn't very interesting for us

Posted by thebals 3 months ago

@thebals - that caterpillar life-history nuance is really interesting. I agree, if a person is very skilled at ID'ing 1 life stage but not others selecting observations to validate based on taxon alone isn't going to produce great matches. Definitely good to think about for future tweaks

Got my 3. Interesting survey.

Posted by rangerpuffin 3 months ago

I have one observation for me to review where there are already a number of identifications to species level. I could confirm it to genus level, but do not have the knowledge to confirm the species. I have therefore not just automatically agreed to the identification.

Posted by katebraunsd 3 months ago

Oh, I've discovered a letter in junkmail. Surely, I am on board!

Posted by apseregin 3 months ago

261 validators have responded (21.0% of the people we contacted)

I would prefer to be contacted via my iNat Inbox. Then you will get an immediate response. (Haven't even opened my email yet today)

I wonder - for the next batch you could compare email response vs inbox response.

Posted by dianastuder 3 months ago

I received 18 mushrooms for validation. Most of them from another continent. Normally I validate nearby species, but I did my best in this case. Maybe interesting to do another experiment based on more nearby species. Just helping with ideas. Curious about the outcome of this experiment.

Posted by rudolphous 3 months ago

I'm willing to help and participate as well if you change the way you select people

Posted by amarzee 3 months ago

@amarzee What is your prefered way of select people?

@katebraunsd per instructions do the best YOU can do (in a reasonable time). If somethings already at RG and you only know its an animal then just ID it as an animal (non conflicting obviously) so they have stats to work with

Posted by reiner 3 months ago

One of my pair is at RG with an ID from the scientist who did a Ph D on taxonomy of that family last year. It is a blurry green picture - best I could do would be dicot. Even without 'hard disagreement' that feels like a foolish ID from me. I left it as is.

PS went back and added Difficult Dicot. Blurry green stuff rules.

PPS this 5 year old comment is already there - and NOT from me ... another one to piss off the fab man

@dianastuder they want the statistics, rather than skipping that one you could put it in as dicots and link back to this post as an explanation

@dianastuder - your Aspalathus example (- sorry was in my set too: https://www.inaturalist.org/observations/23721842 ) is in some regards similar to the Poppy example https://www.inaturalist.org/observations/10056133
To an expert in the group, or to a local fundi who knows all the local species well, there is only one (or very few) options that it can be. People less familiar with the local taxa or the area would be either less certain or utterly bewildered. The big problem though is connecting the experts with this sort of observation - these are not the sorts of observations that specialists normally bother with.
What I find really irksome are those users who because they are unable to make an ID, unilaterally declare that further ID is impossible and post a disagreement with everyone who dares to know more than they do.

For a good clear picture which shows field marks, I would @mention a taxon specialist.
But. Sorry. Not for blurry green stuff. I hate to ID as dicot ... (an almost acronym for idiot - I C dicot :~((

I have no idea how you would code this, but what if the study were blind--that is the selected validators are not shown anyone else's ID at all?

Posted by arboretum_amy 3 months ago

@arboretum_amy I was thinking the same thing, since several of my "random" test observations have now been verified by multiple experts. I think that introduces some bias.

Posted by wdvanhem 3 months ago

But. iNat wants to compare ours to previous IDs - for this experiment.

My concerns have been addressed several times above (not answered) - I was asked to ID observations in areas where I never ID anything and would not normally try, very different floras, although the existing IDs were of genera I have done a lot with. I think this really biases the results - I don't know the variations there of my known species or, local species in the genera or look alike alternatives. I called one a dicot, put another in the genus. This experiment includes approaches I don't use, so isn't really testing IDs by someone with my approach. And it is upsetting.

Posted by patswain 3 months ago

@patswain I agree. It would be better to take location into account when assigning identifiers.

i Agree. I had 6 observations, 2 from Europe where I never identify. Both were for species that are identified as alien invasives in the Cape, and which I am happy to ID in the Cape, but I have no idea of the many very similar species and even genera in Europe. Another was of an easy to ID bird species in the Cape, but for Malawi where there are several similar species that I am unfamiliar with, and which I normally steer well clear of. But rather than upsetting, I enjoyed the challenge: although I could not resolve the Cichorids.
I would suggest that verifiers should be drawn from a local, rather than global, identifier pool.

Just a quick update, at that time 441 validators have responded (36.0%) and validated 896 samples (90.0%). Our accuracy bounds overall are now:
Accuracy (lower): 0.798 +/- 0.0133 [where we count uncertain as incorrect]
Accuracy (higher): 0.956 +/- 0.0062 [where we count uncertain as correct]
Lets give more time for validators who haven't yet responded to do so, then there's some things we can try to get more discussion on obs that have been reviewed but remain uncertain. Also please keep the great feedback for future experiments coming.

@loarie I love this experiment. Can you please explain the Accuracy number (ex 0.798 +/- 0.0133) as if I was a fifth grader? Looking at the numbers you just posted in the update, is it correct to say that iNaturalist observations are accurately identified 80-96% of the time? With an added caveat about counting uncertain IDs as incorrect. How would you frame it in plain language? No expectations that you will explain here in the comments, but it would be good to have it in the follow-up blog post on the experiment's findings.

Posted by muir 3 months ago

Also since we already have 2 sets of numbers - a table at the bottom of the blog post - to compare the moving targets (with plain language footnotes - better quality IDs, or not?)

Response from 21%
36%
??

Remember this study only includes RG observations. In some taxa, the Needs ID are signficantly worse.

Perhaps opting out should be a possibility, for cases that are outside one's geographic area. Contact you, get replaced.

@sedgequeen I think it's ~50% research grade see bar chart on top with title Quality Grade. Probably the sample has the same distributions as all inat data and therefor conclusions on this subset should be in line with all data. Maybe we are seen as the best validators, but that might not be the case for this dataset since we validate outside our normal regions :-)

Sorry. Apparently I was wrong -- again.

I thought it was interesting to be asked to ID several observations that were well outside of my usual region of concentration (plus many more observations in my usual region). I used iNat's own taxa pages to determine first, if the species I thought it was had even been observed at the Research Grade level in the place, and second, what other species I could confuse it with from that place. That's certainly more work that I usually put into my day-to-day identifying, but it was a useful exercise in expanding my knowledge of the range of certain species and their look-alikes.

Posted by lynnharper 3 months ago

@reiner Re putting an ID at genus level when there are already a number of agreeing species level identifications does not seem to be helpful for anyone.

Can we have a cut-off date, when we can go back and remove our IDs, which were added to serve this experiment? (If we want to - mine feel either superfluous at best, or challenging the previous finer ID)

I also got several that were located outside the region where I usually ID. For some of them (honeybees), this didn't really matter. For others, I think my ID is probably more reflective of the knowledge of the IDer than the correctness of the current ID.

I agree that different life stages (e.g. larva vs. imago) and type of media (photos vs. audio) are other areas where having successfully ID'd a particular taxon doesn't automatically translate to being able to ID all types of observations of that taxon.

I don't know if there is a way to distinguish in the results between people adding a higher ID because the evidence is unclear, and people adding a higher ID because they aren't the right person to ID the observation -- in the latter case, I think under normal circumstances most users would mark as "reviewed" or the equivalent rather than adding an ID, so an option to abstain instead of adding an ID might better reflect that in this experiment.

Posted by spiphany 3 months ago

glad that there is a focus to determine the accuracy brought forth on the platform. This largely depends upon the subset of qualified individuals and taxa. Interested to see what occurs here.

Posted by kyleprice1 3 months ago

Excellent initiative by iNaturalist to do this. I was assigned several plants from outside my region and do not personally think that is a problem. Looking forward to the results.

Posted by owenclarkin 3 months ago

I also agree the DMs though the iNat inbox would be good for the next round, perhaps in addition to emails.

Posted by kevinfaccenda 3 months ago

Very cool project. Excited to see the results. So far they are better than expected, but this may be because of the types of observations I focus on (often casual grade).

I went and ID'd my sample of 2 observations, both of which were fairly typical for the type of observation that I normally ID. But I have a question based on the above FAQs, which I didn't read until after I submitted my IDs.

Is it better to use my typical ID confidence strategy or aim for the 99.99% certainty described in the post? I would say I normally aim for about a 90% minimum level of accuracy, meaning that for the IDs I am most uncertain of, there may be about a 1 in 10 chance that my ID is incorrect. This is because many of the observations I ID are neglected, poor photos, obscure taxa, etc. and otherwise no one will ID them. Because I ID a wide variety of species, I also like to learn if my ID heuristic is incorrect, which I won't learn if I don't ever go out on a limb sometimes.

In addition, 99.99% is an extremely high level of confidence, which I doubt many IDers are using as a lower bound. This means only about 1 in 10,000 IDs will be wrong--perhaps fewer if we assume that some observations will have a higher than minimum level of confidence. I have about 20,000 IDs so this suggests that I would only have about 2 observations ID'd incorrectly out of all of those. Seems unlikely for me, and honestly for most other experts I've interacted with on iNaturalist too. I have both been corrected by and have corrected some of the most knowledgeable people on here. It is bound to happen as we cannot be perfect.

One of the observations was a single photo that didn't clearly show the traits that I typically use for ID on that taxon. It was also missing its location, so overall it's probably close to that 90% threshold I usually use. Should I go back and submit a courser ID that I think I can be 99.99% confident of? Or should I stick to my probably correct ID even though there's a notable chance it could be wrong?

Posted by alexbinck 3 months ago

I got the email, but the link to the 3 personalized observations doesn't work.

Posted by naturesebas 3 months ago

Hi folks,

Were up to 491 responding validators (40.0%) and 919 validated samples (92.0%)
the proportions of the entire sample are now:
0.823 correct, 0.131 uncertain, 0.046 incorrect

Hopefully as more validators respond the uncertain fraction will
continue to reduce.

For just the research grade subset, the proportion correct just crossed 0.9:
0.901 correct, 0.077 uncertain, 0.022 incorrect

For comparison, the proportion correct of research grade plants from from the southeast US was assessed in this recent paper to be 0.84.

Here's the proportions for some other facets of this sample (the horizontal dotted line is 0.9, the missing iconic taxon name group labels in order are Amphibia, Arachnida, Mammalia, Plantae):

We're really happy how this experiment is going so far. Thanks to everyone for validating your samples. Also remember only 40% of validators have responded and we gave people until Jan 31 to respond, so hopefully there's still a lot of new validations still coming in. It would be great if we could eliminate as much of the Uncertain group as possible - remember the Uncertain proportion is coming from samples with no validation yet (81 samples) and samples with validation with only coarser non-disagreements (40 samples). In theory more skilled validators responding should reduce both of these sources of uncertainty.

Curious how a situation where the observation was needs ID at family and is now RG at genus will count? This happened with one of the aphids I was sent.

Posted by egordon88 3 months ago

Note: the announcement popup on user dashboards contains a relatively important typo! "Assess" has a total of four s's, not three.

Posted by natev 3 months ago

egordon88, it’s all compared to the sample at the time it was sampled. So that would be counted as needs id.
natev thanks, fixed

I don't know whether to be flattered or insulted that half my samples were Human. I don't think I have any special expertise in identifying humans!

Posted by vireyajacquard 3 months ago

Very excited that you're conducting these experiments and you have the testing infrastructure to run them. Chapeau! I hope someone in the CSCW / CHI community is paying attention and offering to partner for more research.

Posted by radrat 3 months ago

This is really exciting. Thanks fpr the great work and thanks to alle the identifieres who are participating or gernally adding identifications

Posted by hedaja 3 months ago

Looking forward to seeing the final results of this experiment. Has every observation in the sample been assigned the same number of validators? Or do some have more, others fewer? I noticed on my set that some observations have quickly accumulated a bunch of agreeing IDs at species level ("easy" to ID), while others are getting much fewer responses, and one is getting about equal amounts of species and non-disagreeing genus IDs. There's probably data in here to estimate community confidence of RG IDs based on how many IDers weren't confident enough to call the species. This might go beyond the scope of this experiment but I imagine there could be several shades of green in the graph indicating "correct with high confidence" vs. "correct with low confidence."

Posted by annkatrinrose 3 months ago

Is a human going to go through and read all the comments and comments-in-IDs that have been added to the observations over the course of the experiment? There are several caveats on the observation I was assigned. I hope they are not common enough to make a link to the observation generally interesting, but I do want to make sure that whoever's doing the data analysis sees them.

Also, if anyone's keeping track, I am very glad I got assigned a caterpillar and not an adult, because I would not have been nearly as useful on the latter

Posted by bugbaer 3 months ago

Africa leads for correct IDs - wow!

Fungi having the most incorrect IDs seems accurate.

Posted by lothlin 3 months ago

As a matter of interest, would it be possible to determine from the data set the proportion of identifications that improved, versus the proportion that degraded (of course the majority would presumably remain the same) over the duration of the experiment (as community ID)? Although given that 70% of the experimental observations were species rank, relatively few of these could improve (to subspecies or var.). I am mindful that the experiment has probably alerted many users and identifiers to reviews of their identifications and that many have responded to these IDs.
I am particularly surprised about the fungal results: a 90% accuracy seems unbelievable - but if many of these are at higher taxonomic ranks, then that might explain the result.

I'd be curious to have a look at the entire set of fungal observations to see what was included; some species are in fact really easy to ID so if these well understood species were a high percentage of the total fungi species, that percentage might be true.

(after the experiment: cannot interfere with the experiment now ...)

I did not get an email even though I have almost 100000 IDs and 50000 observations. Was there a minimum criteria to get an email?

Posted by yayemaster 3 months ago

(of course, after the experiment :D)

Research Grade observations apparently have an accuracy rate of at least 90% and possibly up to 95% if uncertain identifications are also considered, which is impressive. This accuracy rate is higher than that of many large scientific collection datasets that are on GBIF because these cannot be fully curated to an optimal level. It will be interesting to perform and publish this analysis later on.

Posted by danielcahen 3 months ago

@yayemaster maybe your expertise wasn't covered in the sample set this time? Like no email to me = probably no Vinca pics in there, maybe not Houstonia either? ;)

Posted by lotteryd 3 months ago

It is really interesting. Honestly, I was a little surpriced when I recieved an email about this.
Thank you for your helpful for nature actions, iNat administration!

Posted by makarii_loskutov 3 months ago

Very interesting experiment, I applaud the initiative! That said, I'm a little confused by my assignments and the methodology here, namely that I was assigned 2 aphid observations, one in Australia, one in Austria.

It's a hugely diverse family of nearly 200,000 species, so it's a little bit of a mystery to me that I could get assigned this group when I don't generally identify aphids.

Maybe that was part of the methodology to be unbiased and select identifiers who weren't necessarily experts, but showed some ability to pick out correct IDs ('improving' based on the above notes). My constructive criticism would be it might make more sense to have some sort of threshold on this in the future, maybe something like >50-100 improving IDs on a given taxa, looking at finer taxonomic levels, and (as many above have already mentioned) geographically bounding by areas the identifier participates in (90% of IDs at least on the same continent?).

I'm probably just not fully understanding the study design, and even if I'm not, no study is perfect (so many variables to consider here!!), and I think this is a good start.

Last issue BTW, you may want to emphasize that identifiers will be contacted via the email they registered under, and not through the iNat messaging system, since I expected to be contacted through this system for a day or two, and then realized later it might be in an email I don't check too regularly.

Posted by yerbasanta 3 months ago

@yerbasanta - you bring up good points, and others in this thread have mentioned similar things.

The methodology of finding validators was pretty simplistic, we just looked for people who had made >3 improving IDs within the observation taxon. For example, you have >3 improving IDs of Aphididae
https://www.inaturalist.org/identifications?user_id=436984&taxon_id=52381&category=improving
so according to this criteria were eligible to be assigned any observations at Aphididae. We don't expect you to be able to improve the ID to species, but we hope this past activity means you can recognize family Aphididae and offer supporting IDs at the family level if its correct or disagreements if its a different family.

One thing we haven't talked about is that besides the accuracy of the iNat dataset (proportion correct), we're interested in measuring precision (how finely identified the dataset is). We could have a 100% accurate dataset if everything was ID'd at State of Matter Life, but it would be very imprecise. Accuracy and precision trade off - so we'd like to track both.

Its not exactly clear the best way to measure precision. What we're doing as part of this experiment is to count the number of leaf descendants for a taxon entered into iNat (e.g. Family Aphididae has 2082 leaf descendants on iNat) and measure precision as 1 / number of descendants. An observation at the species level with no descendants would have a precision of 1 / 1 = 1 and Aphididae would have 1 / 2082 = 0.0005. If you measure precision this way, the mean precision of the sample is 0.7317. Some issues with this approach are that Aphididae probably has many species not entered into iNat, so 2082 is probably low. Also we're not including ssp in the precision calculation because so few people use them to ID relative to how many have been entered in the taxonomy (similar rationale to why Research Grade kicks in at the species level and not the ssp level).

Re: geographic issues, I agree that while a person can identify a taxon in one location they might not be able to identify it in another location because they don't know the other alternatives. For example, I feel confident identifying Smooth-handed Ghost Crab in Australia but not in Southeast Asia where there are many other look alikes. Since we're not using location at the moment, we expect validators to just enter the finest ID they can in these situations (e.g. a non-disagreeing the Ghost Crab genus) but it would be great to find better ways to match candidate validators with samples they can help with .

Also a quick update on progress. We've currently had 547 validators respond (44.0%) and validate 938 samples (94.0%). We're realizing that it may have been a mistake contacting validators via email rather than messages since many candidate validators have been active on the site since we contacted them but for whatever reason haven't opened the email we sent them - maybe it got stuck in a spam filter etc. Bug because we set the deadline at Jan 31, we don't want to make assumptions and be too aggressive trying to contact them again.

But moving forward we'll probably try using messages instead of emails. We piloted this today by messaging validators who did respond but who can still help with more unreviewed samples (based on the >3 improving ID criteria) that we didn't originally sent them. This group could review 20 of the remaining 35 observations that remain unreviewed.

Thanks for everyone's patience as we work out the kinks for how to best orchestrate these experiments. We're learning alot!

total sample:
0.84 correct
0.112 uncertain
0.048 incorrect

RG sample subset:
0.916 correct
0.06 uncertain
0.024 incorrect

@loarie thanks for replying to my post with more details on the study design since I was curious. I'm amazed it appears that fish IDs are identified 100% correctly! Some labels are missing I think in the 3rd graph, I'd be interested to know which was 'plants' and how that went.

Also, I don't think it would be considered 'aggressive' to send a follow-up message on iNat if you explained the purpose and kept it brief with a reminder to check their registered email (and possibly spam folders) just in case that was the reason for non-response.

I would have appreciated a reminder or clarification message here (since I was a little confused on how contact might happen/skimmed the wrong email at first), though I obviously can't speak for everyone though, who knows what someone else might consider aggressive or pestering?

I received the email and want to participate but wen I click on the link it doesn't work. I don't understand why?

Posted by amanithor 3 months ago

Why is Africa leading for correct IDs?
Relatively fewer obs and many taxon specialists working on those IDs?

'the remaining 35 observations that remain unreviewed'
Can we have a Round 2 after your cutoff date? We could all tackle the residue.

@yerbasanta as per previous graphs:
missing iconic taxon name group labels in order are -
Amphibia, Arachnida, Mammalia, Plantae

I'm surprised that apparently 100% of arachnid observations were deemed correct so far. I understand many spiders can't be ID'd to species without dissection/microscopy. I'm wondering how many of those are confirmation of higher taxonomic levels vs. species IDs. I'm also surprised by the apparently high amount of uncertainty among amphibians and the wrong reptile IDs. Our local herpers are usually killing it with ID's on those, making birds and herps the two groups receiving the fastest IDs, at least locally. Fungi having the highest proportion of wrong IDs is not surprising to me.

A lot of the things that have been surprising people are just due to the fact that randomness is clumpy and sample sizes are quite small. If you look at the graphs in the original journal post, you'll see that there were.... what.... some 30 observations of arachnids? It's hard to determine the exact number from the graphs, but there were even fewer ray-finned fish. The number of herp observations is also really small. Even if they were all species-level IDs, it wouldn't be super surprising if they all just happened to randomly be very common and easily identified species.

Same goes for the African sample. To me it also looks to be in the vicinity of 30 observations. Weirdness should be expected with such tiny samples. We will need to wait for full-scale experiments in order to be able to draw ANY conclusions about the smaller sample groups.

I wonder if we should not be using the terms "Confirmed" and "Queried" rather than "Correct" and "Incorrect"? "Uncertain" is OK.
Somehow, to my mind, "Correct" and "Incorrect" implies an expertize and independent assessment (both in terms of participants and specimens) somewhat beyond the scope of this experiment.
Then it would be unnecessary to use terms such as "deemed correct."
@annkatrinrose : My gut response to your query about herps, was that the IDs are often local, whereas this experiment drew from international observations. But on second thoughts that is true of all taxa, not just herps. Although if invertebrates were only identified to higher taxonomic ranks, then it is easier to agree, than for herps which may be identified to species by local identifiers, but only to family or genus by non-local identifiers selected in this experiment?
Cant wait to see the detailed results ... only 10 more days to go ...

Would be interesting to perform another experiment with only fungi to understand better what taxons, years or continents are wrongly identified. A wild idea would be to increase the treshold for these hard groups before reaching research grade. Or give certain users a higher / lower votecount based on historical correctness.

@rudolphous its every fungus that isn't well known, but has a well known look alike - see the fact that there are 13k mushrooms IDed as Amanita bisporigera (only 2.2k research grade though) even though there are multiple other white sect. phalloideae mushrooms in the same range. Meanwhile, there's 89 observations (61 RG) of Amanita suballiacea, but the range map of those RG specimans (27 of which have been confirmed by DNA sequencing) is about as broad as that of A. bisporigera. The disparity, in actuality, probably isn't as wide.

https://www.inaturalist.org/observations?place_id=any&subview=map&taxon_id=926048&verifiable=any&view=species&field:DNA%20Barcode%20ITS= this link shows every mushroom from Amanita sect. phalloideae that has an attached ITS sequence on iNat. Just based on this very rough look, I would expect A. amerivirosa and A. magnilevaris to also have more observations then they do.

But for the most part, all the east coast pure white sect. phalloideae shrooms just get shoved into A. bisporigera.

I'm sad now, my link doesn't work and I can't participate. @loarie send me the link by message and still don't work

@lothlin Thanks for the information.
@amanithor Maybe another browser, computer or device works? You can also try to open a tab with incognito mode if your are using firefox or chrome and follow the link here . The link contains a few observation id's. Maybe you can search the observations yourself based on the observation numbers and give id's instead of using the link?

Excellent, following!

Posted by jasonrgrant 3 months ago

iNat has some really knowledgeable and caring curators. I thank them from here.
I believe that those making definitions in iNat must have certain competence.
The use of Wikis in the definition and distribution references of some taxa undermines reliability.
It can be seen that some incorrect definitions are not corrected when objections are made with justified and consistent reasons.
Greetings to everyone..

Posted by kavurtdagli 3 months ago

Yesterday we messaged a reminder to candidate validators who didn't respond to the original email and now have a 62.0% response rate and 98% of the sample validated. So far the results for the whole sample are:
86% correct
9% uncertain
5% incorrect

RG subset:
94% correct
4% uncertain
2% incorrect

We're excited to share the final results next week. Thanks again to everyone who participated in making this experiment possible.

I've now responded, whoops, didn't see it in my email...

Intersting idea that I'm certainly a fan of and wouldn't mind doing in the future!

Posted by radbackedsalamander 3 months ago

Great to know about this project. I will participate thoroughly, thanks for informing!

That will surely fine tune a lot of identifications, especially in Brazil (even if things funnel out to family order), great iniciative!

Although I would suggest sending the next steps via message because I (and I think most of us?) rarely check e-mail when it comes to iNat ID comments or other things 😬.

Posted by gianlluca_au 3 months ago

Agreed... And I think mine might of gone to spam as well.

@loarie "many candidate validators have been active on the site since we contacted them but for whatever reason haven't opened the email we sent them - maybe it got stuck in a spam filter etc"

Just sharing that I did receive an invitation to participate via the on-site messaging system but, after checking thoroughly my email archives and spambox, I can confidently say that I've never received the original email or any reminder. There may be an issue with your mailer, using the on-site messaging system seems like a much better option to maximize the response rate.

Also looking forward to learn more about the study design. As someone who identifies almost exclusively birds, I was surprised to receive an invitation to review a plant observation.

@radrat - we added you and a few others who commented on this post and expressed interest in participating but weren't selected by the process that generates and assigns the sample as additional validators after the fact. Thats why you didn't receive the original email.

We've now had 844 (69%) validators respond with 96% of the sample validated with an average of 4 validators per sample.

Once the deadline is passed on the 31st we'll share a new feature for exploring these results. Thanks again for all the participation!

@loarie thanks for the clarification and additional updates!

Than you need much more experts from all countries...

Posted by ozgurkocak 3 months ago

I didn't receive the original email. Perhaps it got stuck in my spam folder. I'll have a look. Meanwhile, I'll look over the observations you suggested for me.

Posted by beartracker 3 months ago

@loarie
OK, the original email is not in my spam folder, not in my junk folder, and not in my Inbox. Sorry. I never got it. Is there a link to the particular observations you'd like us to look over?

I would love to see a similar project to assess the accuracy of animal track observations on iNaturalist!

I would like to participate too, this round or the next one :-)

Posted by misumeta 3 months ago

A bias may result from iNat vocabulary, at least in French :
After a first ID has been suggested, users are expected to ACCEPT the ID : to accept doesn't require skills, if I'm not skilled, I will accept naturally all suggestions.
It could be better to use the word CONFIRM , to confirm an ID requires skills
In French "confirmer" instead of "accepter".
.
And a notice could appear when the mouse reaches the word CONFIRM : "only confirm if you are skilled"

Posted by berzou 3 months ago

Here is an example of the most common wrong ID on researchgrade (in my opinion...should be evaluated):

https://www.inaturalist.org/observations/188999484

First wrong ID by a "specialist" that lacks the expertise (at least in that region)
and the blind confirmation of the photographer

It should be discussed, how this can be avoided.

One way to avoid it would be to actually post the correct identification, rather than merely leave a comment.
A few notes on how to tell the difference would also help.

or push the ID back to Complex - with the comment
https://www.inaturalist.org/taxa/1045199-Leiobunum-rupestre

We just posted the results of the experiment here thanks everyone for participating and excited to continue the conversation there!

@tonyrebelo first time I didn´t, just to demonstrate what the main problem is
Complex is a nice solution, yes!