We estimate the accuracy of Research Grade observations to be 95% correct!

Thank you to everyone who participated in our first ever Observation Accuracy Experiment that we launched 2 weeks ago. From this experiment, we estimate the accuracy of Research Grade observations to be 95%. Keep in mind that these results are drawn from a relatively small sample size, but this is the first quantitative accuracy estimate we've had. You can explore all the results and click through to the sample here. We're very excited by these result and eager to build on them with more experiments!

As explained in the methods section, we generated a sample of 1,000 observations on January 16 and selected 1,232 candidate validators who each had at least 3 improving Identifications for the corresponding observation taxon. By today’s January 31 deadline, 887 (72%) of the candidate validators participated, validating 96% of the sample. On average each observation was validated by 4 validators. From these we calculated Accuracy (correct, incorrect and uncertain) and Precision.

Exploring the results

For the Research Grade subset, 95% were Correct, 3% were Uncertain and 2% were Incorrect. The average Precision was 99%.

By clicking on other tabs you can see the Verifiable subset (Research Grade + Needs ID) had a lower accuracy of 90% Correct, 5% Uncertain and 5% Incorrect. The average Precision was 75% with fewer observations sitting at rank species compared to the Research Grade subset. For the entire sample (including Casual observations) the accuracy was 87% correct, 8% uncertain and 5% incorrect. The average Precision was 73%.

The "Accuracy Results by Subset" section allows you to see the results grouped in different ways, for example by Continent. These bars are clickable allowing you to for example see the 6.25% of Research Grade observations from Asia that were assessed to be incorrect. Remember that these characteristics such as a quality grade, continent etc. are from when the sample was generated and may have since changed (e.g. if someone marked what was a Research Grade observation as captive/cultivated thereby making it Casual).

The blue button allows you to toggle between frequency and percent to see the sample sizes involved. For example, 6.25% is equivalent to 3 incorrect Research Grade observations from Asia.

Next steps

The results from these experiments are valuable in helping us all develop a shared understanding of what’s driving the accuracy and precision of the iNaturalist dataset and what changes we should make to improve it. We’re excited to continue doing experiments like this on a monthly basis. We want to learn from you what worked or didn’t work for you as part of the validator process, ways we could improve the experiments, and thoughts on opportunities suggested by these results to improve accuracy and precision.

Thank you so much for all your work creating this unique iNaturalist dataset by adding observations and identifications, and for your participation in making experiments like this possible!

Incorrect and Uncertain Research Grade observations

We wanted to end this post by digging into the incorrect and uncertain Research Grade observations assessed by this experiment in more detail.

Incorrect


Of the 13 Research Grade observations that were incorrect, 8 were what we’re calling the Other Branch scenario in which validators selected a taxon on a different branch (2) than the observation taxon (1).


3 were what we’re calling the “Shouldn’t rule out Y” scenario in which validators thought the observation taxon (1) was too precise and that it should be rolled back by adding a disagreeing ID higher up on the tree (2) because other alternatives shouldn’t be ruled out.

2 were what we’re calling the “Multiple species” scenario where it's unclear what the subject of the observations is (in these case because multiple photos of different organisms were included in a single observation). The observation was Research Grade at the taxon shown in the first photo, but the norm in these scenarios is to roll the observation taxon back to the common ancestor of all the taxa associated with the various subjects.

Uncertain

Of the 14 Research Grade observations that validators classified as Uncertain, 3 were because we were unable to find or get responses for candidate validators.


7 were what we’re calling the “Uncertain beyond Z” scenario in which validators thought the observation taxon (1) could be correct but were uncertain about alternatives so added a non-disagreeing ID higher up on the tree (2).

4 were situations where more than one validation conflicted which we code as Uncertain. In 3 of these cases, one validator went with what we’re calling the “Likely X” scenario by adding an agreeing ID (2) to the observation taxon (1). The other validator went with the “Shouldn’t rule out Y” scenario described above.


In the 4th case, one validator went with “Likely Y” while another went with “Other Branch” by choosing a hybrid between the observation taxon and a sibling species. This shows that there is some subjectivity in validation, particularly in how precise observations should be.

Our rule of thumb is that validators should base their decision on avoiding risk by ~99% (e.g. if you think there’s a less that ~1% chance that it could be an alternative like Painted Lady or West Coast Lady you should select “Likely X”, if the probability of an alternative is greater than that you should select “Shouldn’t rule out Y”, and if you are uncertain about those probabilities you should select “Uncertain beyond Z”), but we realize that judging these probabilities will always be subjective.

We’re hoping that as we run more experiments, we can start developing shared archetypes for common situations that lead to incorrect or difficult-to-assess observations to help us design improvements to reduce them (e.g. the Multiple Species scenarios). We know only a handful of potential scenarios were represented in this small sample, and are eager to see what other common scenarios come up in future experiments.




Posted on January 31, 2024 10:25 PM by loarie loarie

Comments

It's great to see the results of this experiment! And it's really quite encouraging that RG accuracy is as high as 95% across a wide range of taxa.

I'm very glad also to see the mention that "we can start developing shared archetypes for common situations that lead to incorrect or difficult-to-assess observations to help us design improvements to reduce them (e.g. the Multiple Species scenario)".

There were some pretty good suggestions in this forum thread about ways to adjust the UI to reduce the multiple species problem both up-front and retroactively. I'd be very interested to see whether research supports making some of these changes.

Edited to add: In fact, it looks like I already proposed an experiment that would generate data to inform the design of fixes for the multiple species problem.

Posted by rupertclayton 3 months ago

Wow! Thanks for the report!

Posted by connlindajo 3 months ago

Sample size for Africa was only 20 observations, so that result does not mean anything. It would be good to do this again with a relevant sample size.

Posted by traianbertau 3 months ago

@traianbertau, I agree that the size of the Research Grade Africa subsample was small (20 / 1000), but its representative of the entire iNat dataset:
sample RG Africa / sample RG = 20 / 534 = 3.7% ~ 3.4% = 3,656,174 / 107,711,982 = total RG Africa / total RG
What do you think would be sufficiently large sample size to get meaningful estimates from Africa? We could either increase the entire sample (e.g. to get a RG Africa sample of 1k we could repeat this design with a global sample of 50k which would increase validator load by 50x, or we could do an Africa specific substudy sampling just 1k obs from the continent).

Posted by loarie 3 months ago

could the validators see the original IDs, or was all of that obscured? In my opinion all prior IDs should have been obscured. I think this would remove any chance of the validator just agreeing with the IDs as already presented.

Posted by jason_miller 3 months ago

Yeah, I agree with @jason_miller. I feel like some significant bias was introduced by giving an obvious choice to go to. Making this a blind study would greatly help in reducing that bias. Also, along these lines would be to take away CV on those observations. Anything we can do to limit implicit bias would help greatly.

Also, I'm sure someone will bring it up later as well, but the observations to be IDed by each identifier were chosen by taxon alone. For example, this meant that I, someone who really only knows NA taxa, was left to identify taxa in places such as Russia. If/when another round is run for this experiment, it would make sense to take locality into account when choosing identifiers. Maybe we could do it as we are currently doing it, but add on the caveat that observations will only be given to people if they are within a 100km radius of the observation.

Posted by eric-schmitty 3 months ago

@jason_miller, I agree that making sure IDers can't peak at previous ID's is important candidate criteria for selecting validators. As described in the methods, we used only improving IDs as an indicator of skills which are the first IDs to propose a taxon - so I think we're on the same page in that selected validators wasn't biased in this way.

However, once we chose qualified validators and sent them their sample they were able to see past IDs on the samples they were validating. I agree that if our methodology wanted to simultaneously assess observation accuracy and validator skill it would be important to design a blind experiment. But in our methodology we were only interested in assessing observation accuracy.

(But I acknowledge that at least 3 improving IDs is not a perfect model for skill at identifying a taxon. A validator could have met this criteria just by luck (lets say they ID'd 1,000 unknown observations with 'honey bee' and by chance got 3 of them correct) or maybe used crutches like CV suggestions. The only window we have from this experiment on validator skill are from the 3 situations where validators proposed taxa from alternate branches (1 of the validators must be wrong). But I agree 3/1000 might be low and one should design an experiment where they can't peak at each other's answers if one wanted to assess validator skill)

Posted by loarie 3 months ago

@eric-schmitty I agree with you that geographic expertise is important because to be an effective validator one needs to know not just the taxon, but all the alternatives. As I mentioned in the thread here, I feel confident identifying Smooth Handed Ghost Crab in Eastern Australia where there are just 2 alternatives, but not in Southeast Asia where there are more like 6. But while this may have led to more Uncertain's (me adding an ID of Genus Ghost Crabs to obs of Smooth Handed Ghost Crab from Southeast Asia), it shouldn't have led to validators misclassifying obs as Correct or Incorrect.

An important detail here is that by reporting the accuracy as the 95% Correct statistic we are sort of implying all the Uncertain obs are Incorrect which is almost certainly not the case, so its a conservative estimate of accuracy. If we rather assumed the 3% of RG Uncertain obs were all Correct the RG accuracy would be 95+3=98%. The real estimate is probably somewhere in between these 95-98% bounds.

I'm very happy the Uncertain portion was so low (3%) from such a great response that we got from validators. I was worried this might be much higher.

Posted by loarie 3 months ago

@loarie

I don't know what sample size would be required. But the problem is rather the selected species. Out of the 12 represented animal species, 9 are unmistakable ones that very likely have never been misidentified to RG for species level (subspecies may be rather tricky perhaps). Of the 3 somewhat difficult animals, one (the Pedioplanis lizard) turned out to be a possible misidentification. Would it be possible to draw a sample of difficult or at least not 100% unmistakable species? I am sure, there are plenty of RG misidentifications for Africa when it comes to invertebrates or plants. I do mostly insect IDs and find misidentified ones pretty much every day.

Posted by traianbertau 3 months ago

I feel like this is a really important thing to gather more data on, especially with a larger dataset. In particular, if I'm reading this right, then there was only one fungus considered incorrectly identified but it still accounts for 16.6% of all fungal observations studied, which indicates a serious need for a wider analysis, although of course that does take much more effort.
Very interested to see where this goes, if there will be any future studies done, etc.

Posted by zachary_kalafer 3 months ago

@traianbertau great points. We did track rare vs common species as Taxon Observations Count (ie. the how many times the taxon has been observed total). And there is a trend where obs of rarer taxa with lower obs counts have lower % correct which supports your theory (though that might be driven more by increasing Uncertain rather than increasing Incorrect).

But the issue remains that because most iNat obs are of more commonly observed species (median of 16,140 obs count) the subset of the sample of rare taxa with <1000 obs is already very low (132 / 1000). And if we wanted to focus in on subsets of subsets (e.g. research grade & <1000 obs & from africa) the sample size gets very small very quickly (3 / 1000).

@zachary_kalafer, correct there were 55 Fungi obs in the whole sample (of 1000) and 6 in the RG subset (of 534) which reflects the overall rarity of fungi in the overall iNat dataset.

I think both of you raise important points which is a reminder that this study was meant to give an estimate of the accuracy of the iNat dataset overall. The accuracy estimates aren't likely to be biased, but its important to recognize that iNat itself is very biased towards North America, Plants & Insects, common taxa with > 10k obs, etc. (e.g. I focus on decapods and there were only 2 in this entire sample, which again reflects their relative rarity on iNat)

We can use this methodology to get estimates on subsets of obs (e.g. just obs of fungi, or just RG obs of rare African taxa) but to perform them we'd either have to scale up the sample size for the database-wide design so we can still have meaningful sample sizes for niche subsets. Or we'd have to do experiments focused just on particular subsets (e.g. a sample of just 1000 African Insect obs). Based on the great response from validators for this experiment v0.1, I think we could push the sample size up from 1k to 10k and still keep the once-a-month schedule we're planning without too much burden on validators, but I'm curious to know what people think. A sample of size 100k might be pushing it? Are there subsets that just aren't that interesting like Casual obs that we would be wise to drop from the sample?

Posted by loarie 3 months ago

Great result!

Posted by rangerpuffin 3 months ago

Very interesting and encouraging findings. Were there other criteria in selecting validator candidates beyond the three improving identifications? Was there subsampling of identifiers in addition to a subsampling of observations?

Posted by alex_abair 3 months ago

@alex_abair, the only criteria for selecting validators was that they'd previously made at least 3 improving IDs of the respective taxon. We contacted validators trying to get up to 5 validators for each sample. We ended up with an average of 4 validations per sample observations but some observations had no validations either because there were no qualified candidate validators (e.g. this for this one as shown here) or because none of the candidate validators responded (e.g. for this one). What do you mean by subsampling identifiers?

Posted by loarie 3 months ago

As a general principle I do see the value in blind reviewing and agree that it typically produces robust results by avoiding some of the biases inherent in seeing previous IDs and who made them. But, I also think that the importance of blind reviewing can be overemphasised. When a taxonomist/expert reviews specimens in a museum or herbarium, no one expects them to do so blindly; they have access to existing det slips and information regarding previous IDs and IDers on those specimens. So I don’t see necessarily see non-blind reviews being a major concern here either

Posted by thebeachcomber 3 months ago

"what we’re calling the “Multiple species” scenario where it's unclear what the subject of the observations is (in this case because multiple photos of different organisms were included in a single observation). The observation was Research Grade at the taxon shown in the first photo, but the norm in these scenarios is to roll the observation taxon back to the common ancestor of all the taxa associated with the various subjects."
I though and hoped that the identification would relay on the first photo, deliberately selected by an observer as such. The matter is that further photos may be added to tell more about the organism and its environment, that can be very useful for science and public. For instance, if you have a mixed forest formed by different tree species, or a multi-species butterfly puddle, it is useful to crop a photo to a single species for the first one but to add general photos to show the circumstances and environment (together with which other trees this one grows, or with which other species some butterfly may congregate together and with which not). I always used this approach bearing in mind such additional scientific goals and would be disappointed if them will be rolled up to e.g. 'vascular plants' or 'Papilionoidea' some day.

Posted by oleg_kosterin 3 months ago

@oleg_kosterin the situation you describe is a bit different. If all of the photos in an observation share at least one species depicted in all of them, then there is no issue, and it’s completely fine for other taxa to also appear in those photos, as long as that focus species is shared among images

The situation described in the blog post is one where users combine photos of totally different organisms, often also from different times and places, such that each photo is entirely independent of/unrelated to the others

All of the photos in an iNat observation should depict the same organism at the same time and place, but then other taxa can also appear in these photos

Posted by thebeachcomber 3 months ago

@thebeachcomber, thanks for your explanation and a big relief. I am sorry I misunderstood the above.

Posted by oleg_kosterin 3 months ago

@loarie By subsampling validator candidates, I meant did you only reach out to a subset of the identifiers who met the validator criteria? 1,232 was a surprising number to me. I figured there would have been a much greater number of people who met the criteria.

Posted by alex_abair 3 months ago

Interesting re: Multiple Species. I agree there are a couple related issues here going on that we might be able to address with a single fix or might have to treat differently. Just thinking out load, I feel like there are at least 3 related situations I've labeled:

The example mentioned in this post would be "A. Multiple photos representing different subjects". People have proposed a DQA flag that kicks these into casual similar to "Location is Accurate" if, as @thebeachcomber defines it, the subject isn't in every photo.

But there's also examples of "B. Ambiguous subject" where its not clear what the subject is. And even if the observer implies it via their ID as in this example these kinds of photos can confuse the CV model. People have proposed bounding boxes/cropping tools as a way of addressing this.

Thirdly, there's the "C. Subject not in every photo" case that I'm guilty of here where (unlike A) the photos all relate to the subject but its not in every photo. One option is not to allow photos that the subject is not in, but thats a slippery slope with things like evidence of the animal (prints etc.) and as @oleg_kosterin points out there's value in habitat shots etc, but they do confuse the CV.

Bounding boxes could in theory help with all 3 of these (e.g. in A and C the first photos would have a bounding box around the frog /shrimp and the remaining photos would have no bounding box as a signal that the subject isn't present in them, for B the bounding box would be around 1 of 2 possible subject). But there are lots of ways we could solve this.

@alex_abair, we started with all possible validators (based on the 3-improving-Id category) and then randomly dropped as many as we could to shorten the list of candidates as long as it didn't drop any observation below the 5 validator (or fewer) redundancy goal. Thats how we whittled it down to 1,232.

Posted by loarie 3 months ago

@loarie, thanks for the consideration. Bounding boxes would be a solution, adding time necessary to submit but insignificantly . However it is difficult to imagine how to deal with those millions observations already uploaded which would not have boxes yet. (Some people may add them backward but this would be a slow process.) Maybe AI could be instructed to pay attention to the boxes only in observations uploaded after the date they are intriduced? And would they be compulsory for plain cases like a butterfly sitting flat on the ground?

Posted by oleg_kosterin 3 months ago

Great results!

Posted by beartracker 3 months ago

It would be cool to do an experiment for not easy to ID species with blind reviewing (no CV, no previous identifications) and a control goup with CV and prev. IDs available (for the same sample) and then compare the outcomes in order to estimate the confirmation bias.

Also, re next rounds with larger samples: I would not mind identifying a larger set, preferably when it is from the region I am familiar with. (If majority of the to ID set is easy species, it can be done within a few minutes anyway).

Re "rare" or difficult to ID species/taxa: it is not the same, some rarer species are super easy to ID and some common taxa are tricky.

Posted by traianbertau 3 months ago

amazing work! I do think that scaling this up would help. Doesn't even have to be by a ton, 10k observations and double the number of identifiers would probably be plenty!

Posted by astra_the_dragon 3 months ago

I think that specifically looking at IDs for taxa with (relatively) fewer observations and/or a lower percentage of research grade observations might provide more insight into iNat's ID processes and how well the community ID works not just for "easy" taxa but also for ones that require a bit more expertise. Commonly observed taxa will also tend to be taxa that are well-known and reasonably feasible to ID, so there will be a larger pool of users looking at these observations and correcting mistakes. The real test is how well it works when there are more limited numbers of IDers and any mistakes are at risk of getting perpetuated by injudicious agrees or popular misinformation about a taxon.

I realize that finding multiple users who can validate IDs for less common taxa may be a challenge, particularly if you try to match them with a geographical area of expertise, but I am just throwing out some thoughts here.

Posted by spiphany 3 months ago

Habitat shots

If we could annotate single photos - it would be clear to CV and humans.
CV is obviously using habitat to say - pictures of this plant / landscape are obs of This taxon - which may or may not be true. (I remember a CV seal, halfway up an inland mountain, and not in rehab)

Posted by dianastuder 3 months ago

Since some comments here and previously are about the question of blind identification and bias according to previous identifications being present, I have a point to make here. I think the valididation identifiers that were "forced" to step out of their comfort zone and had to Id taxa and/or observations in regions unfamiliar to them are heavily influenced by either using CV or following previous identifications.
I have a good example: the observation is a honey bee from Livingstone (SW Zambia) and was identified to subspecies level as Apis mellifera ssp. scutellata.
I made my identification to species level as Apis mellifera but did not disagree to the "Is the evidence provided enough to confirm this is African Honey Bee Apis mellifera ssp. scutellata?"-thing. Three identifiers after me just agreed to the subspecies-level ID obviously not questioning the first two IDs.

I had actually (not long ago) studied the question of subspecies of Apis mellifera in Africa and therefore knew, that bees from NE Namibia, SW Zambia and the Zambezi valley can't be identified to a subspecies, this area is a zone of introgression between A. m. scutellata and A. m. adansonii.
(My "wisdom" comes from a PHD thesis available for download here: Radloff, S. 1996. Multivariate analysis of selected honeybee populations in Africa
https://commons.ru.ac.za/vital/access/manager/Repository/vital:5734/SOURCEPDF?site_name=Rhodes+University)
Obviously none of the other identifiers was aware of this. And this is when the confirmation bias sets in - you just agree without actually considering that you do not know how to identify this taxon.

I assume that similar identifications could have come about on other observations during this experiment when users had to identify things they do not properly know. Therefore in some cases the confirmation bias may have led to "wrong" validitations and the iNat RG observations accuracy could be much lower than 95% correct ???

Next rounds please with blind identifications or only identifications within the identifiers field of expertise.

Posted by traianbertau 3 months ago

I don't know why everyone so happy about 95%, it's not high accuracy for species ID, even if to accept how it was estimated. It means that there is something like 5 millions of incorrect records in GBIF transferred from iNat. It only proves that there should be stricter requirements for "RG", at very least 3 IDs instead of 2 and/or observer's ID should not be considered for "RG", as most users blindly agrees with any ID they receive.

Posted by igor117 3 months ago

FYI one of the “Shouldn’t rule out Y” set was actually a multispecies issue.

Posted by lotteryd 3 months ago

@loarie noting typo in the first paragraph: "but this [is the first] quantitative accuracy estimate we've had"

Posted by zdanko 3 months ago

I'm still really curious about which fungi species were included - I'll be honest, the accuracy seems high to me, but the dataset could have been skewed by the huge amount of common, easily IDed fungi that are posted and make it to species with no issue.

Posted by lothlin 3 months ago

I agree that it needs to be done with a significantly larger sample size. That would also enable more breakdown into different taxa. However, this was a very successful first experiment overall I think. It's good to see that there was such a good response from verifiers, and it suggests that a more thorough experiment in similar style is viable.

Posted by matthewvosper 3 months ago

Very nice experiment and writeup. Congrats!

Posted by radrat 3 months ago

@traianbertau it would be good to heard more from the validators, I left a comment here. I suspect the discrepancy comes more from subjective differences in assessing risk rather than group think but it would be interesting to hear their reasoning. But I agree with you, if that obs is likely A. m. adansonii (e.g. more than ~1% or its just really uncertain how likely alternatives are) and validators added ID's of A. m. scutellata thats a validator skill issue (or an instruction clarity issue) that there are several things we could do to improve.

@igor117, I agree that I don't think 95% is necessarily anything to celebrate. But I think its probably higher than most sources to GBIF, for example this recent study found only 76% correct for Herbarium data (vs 84% for iNat RG obs) from Southeastern US plants. What do you think is a good accuracy threshold for sources to GBIF should be?

@lotteryd & @zdanko - good catches, thanks fixed both of those!

@lothlin, you can explore which fungi species were included by clicking on the Fungi bars on the Iconic Taxon Name graph here (e.g. correct,
uncertain, &
incorrect)

Posted by loarie 3 months ago

AH! I feel silly I didn't realize the bars were click-through. Thank you for correcting my stupidity.

Posted by lothlin 3 months ago

Nice job on the experiment! I especially appreciate the transparent approach you've taken the entire time. Hope it's just the first of an on-going process & that issues that come up (e.g. the multiple photos problem(s) ) can be addressed.

Posted by matthias55 3 months ago

Amazing results but I would really like to see this replicated with more specific subsects of observations. I'd imagine something like Fungi, with both its inherent challenges and the larger amount of inexperienced amateurs, would likely have less accuracy than something like the Aves. It'd be cool to collaborate with experts in the fields to do this same study and try to figure out how that accuracy varies across clades. Maybe even try to identify particularly problematic clades

Posted by common_snowball 3 months ago

@loarie: Your breakdown of three different types of multispecies problem is very helpful. The (many) discussions of these issues in the forum often stumble because these different problems get confused. I believe that currently issue A, "Multiple photos representing different subjects", is the most frequent and hopefully the most tractable.

It is definitely worth trying to address the issue at the point an observation is created. Anecdotally, it seems clear that most observations with issue A were created via the mobile app by inexperienced users choosing a bunch of photos they would like to ID. I'm hoping the new mobile app will make it easier for observers to avoid this mistake. Some simple logic to compare timestamps and/or locations between photos could catch a lot of these instances and allow the app to trigger a prompt to help the user fix the situation.

But we also have this problem with some small portion of the 190 million existing observations. There are two main motivations to fix issue A in existing observations: to improve the experience for those observers still active on iNat, and to improve the quality of iNat data (with consequent improvements to CV, data shared with GBIF, etc.) Adding a DQA flag for "Multiple photos representing different subjects/species" would be a good starting point for both goals. Like other DQA flags, it would make the observation casual and should require a majority vote, to prevent abuse. With that in place, identifiers have a clear way to signal that the observation has photos of different species. This can then be used to trigger a notification to the observer (if still active) and guide them through the process of splitting the observations.

Having that flag in place would also allow (but not require) iNat to take a later policy decision on some automated approach to splitting/focusing these "Issue A" observations. I believe there are ways to do that while respecting the user's ownership of their data and their geoprivacy (e.g. via obscured locations derived from individual photo metadata), but that seems like a decision that's a long way down the road.

Issues B and C are also real problems, but appear to occur with much lower frequency. For Issue C, a simple per-image checkbox for "habitat/informational material" that excluded the image from CV might be a good approach, this could be exposed for identifiers to vote on in order to improve the quality of historical data.

Posted by rupertclayton 3 months ago

I agree that it's good that many community members were interested and involved in this experiment, although don't consider the results entirely accurate, for reasons others gave and additional reasons. This has also similarly been true of past attempts to estimate RG accuracy. In each case, the actual accuracy remains unknown but was overestimated in the results. The statement "I think its probably higher than most sources to GBIF" also is unlikely, since the largest contributors to GBIF before inaturalist grew larger to become the largest contributor were nautral history museum specimen collections, e.g., of insects, birds, bats, etc., which overall would be expected to have higher accuracy.

Also, the similar website Bug Guide, which also periodically sends (all) of it's records to GBIF, is expected to have higher accuracy because they use website standards and features that do more to promote accuracy. Also, the fact that each time inaturalist RG estimations are published many identifiers provide corrections to the design and results suggests that maybe those opinions or consultation would ideally be considered before designing and conducting these experiments. I also remain taken aback that we're treating multispecies observations as a problem, given that the guidelines don't even allow users to upload them. Clearly, as some suggested, there could be better corrections to prevent those from being uploaded in the first place. Having said all that, again, it's good that many identifiers and observers are helping to maintain a RG accuracy that can be considered relatively high for certain taxonomic groups. And in some cases, surprisingly high RG progress has recently been made, for example as I recently published for bees and wasps here.

Posted by bdagley 3 months ago

Many datasets derived from specimen collections in museums and herbaria have low ID accuracy. One recent study found that 47% of 1000+ Hedera physical vouchers from a range of European herbaria were not accurately identified

You can visit virtually any museum or herbarium in existence, and there will be multiple taxa that are poorly curated and have low identification accuracy simply because they lack the resources, expertise, etc. The same problems that exist in iNat also exist for natural history collections

If you were to make the statement "I think [95% is] probably higher than most sources to GBIF" to many museum collections managers, I don't think they would disagree at all (from my experience of working with and talking to many museum and herbarium curators across Australia)

Posted by thebeachcomber 3 months ago

It seems from the methodology that the 1000 sampled observations were taken purely at random from all iNaturalist observations. This means that the sample probably mostly contains species that are frequently observed.

I think that frequently-observed species may tend to be easier to identify based on pictures than infrequently-observed species. The reasons could include: being bigger and therefore their structures more easy to verify on the picture, being more colorful (and perhaps easier to identify visually from pictures without dissection or measurement), being better known (charismatic or better studied for any reason), etc.

If this hypothesis was true, we would expect the frequency of observation of a taxon to be positively correlated with its identification accuracy. It would be interesting if samples could be taken from taxonomic groups of different sizes and observation frequencies, to be able to check whether the 95% accuracy reported here can really be extrapolated to the whole database, or whether it applies only to a subset of observations (e.g., charismatic, frequently observed organisms).

Posted by elbourret 3 months ago

A 5% error rate is pretty good, but does mean that there are a mere 5.4 million mis-IDs of the 108 million research-grade observations.

Posted by petezani 3 months ago

@ thebeachcomber I'm mostly familiar with museum insect collections as sources sent to GBIF, and estimate that their overall accuracy is higher than the current overall inaturalist RG % accuracy, which I estimate is lower than 95%. I haven't checked whether there are more animal or plant musuem/hebaria collection datasets sent to GBIF, but considering only animal collections (e.g., birds), I also assume most of them are highly accurate. A full answer would require checking how many plant and animal GBIF sources there are, estimating their accuracy, etc. I don't doubt that some herbaria have lower accuracy.

Also, I should have added in my first comment, I currently estimate that the RG % accuracy is very high, possibly around 95% or higher, for particular insect groups including eastern US and Canada Bombus and North America or global Vespidae, which have been intensively identified in the past 3 years. Although, I expect that the average website % RG accuracy for all wildlife groups combined is lower than those insect examples taken alone, since those groups are the most intensively identified, despite that exact estimates would be difficult to make. The current high Hymenoptera % accuracy shows that achieving a high overall RG % accuracy is possible, and certain other intensively identified plant or animal wildlife groups may also have a current high accuracy. Although, that level of accuracy requires many IDs so is difficult for identifiers to maintain, and would be lost and decrease if extensive identifiers stopped identifying, so it would help improve and maintain a high overall accuracy to add additional accuracy-promoting modifications or features to the website in the future.

Posted by bdagley 3 months ago

@elbourret, we did that and you can see a trend that might support your theory here. These are sample observations grouped by their taxon's observation count (common species have lots of observations, rare species have few):

@petezani - if you believe the estimates, then up to ~5M iNat obs in GBIF would be misID'd (~2M incorrect and ~3M uncertain). But many (most?) of the uncertain observations may be correct, we just didn't have the capacity to validate them in this experiment. But I agree 2M is a lot, and 5M is even more and there is surely additional uncertainy in these estimates...

Posted by loarie 3 months ago

@loarie this is really interesting. Now I wonder is whether these differences in accuracy represent something intrinsic to the different taxon, which is also correlated with the frequency of observation, or whether it is the low frequency itself that is affecting accuracy (e.g., by leaving fewer accurate records to compare with the sample to identify). I also wonder whether taxa that have lower accuracy on iNaturalist also have lower accuracy in Museum specimens. I can imagine that for taxa require dissection or microscopic examination, Museum specimens might be better identified in general, but for other groups the iNaturalist records might be as accurate or perhaps more than the Museum specimens.

Posted by elbourret 3 months ago

@loarie @oleg_kosterin @thebeachcomber - [Great study and results!] Regarding Scenario C (organism not in every photo), I have wondered about the logic of changing ID in cases where photo 1 is clearly properly identified. It is very easy to accidentally upload the wrong secondary photos. I have probably done it a dozen times. I will always edit my observations if the problem is pointed out to me, but many users will not and applying that rule makes valid data unusable forever. Is there a background reason for downgrading records with irrelevant secondary photos, such as it creating problems with training the AI?

Posted by seanblaney 3 months ago

I might have missed it buried in the above, but I'd be interested in seeing a breakdown (or future study) by taxonomic group. I imagine certain groups have substantially better expert coverage and are more likely to reach high accuracy than others.

Posted by msr 3 months ago

@msr if you go here https://www.inaturalist.org/observation_accuracy_experiments/2?tab=verifiable_results
you'll see the "Iconic Taxon Name" graph. If you click on, e.g. incorrect verifiable Insects leading here
https://www.inaturalist.org/observations?id=171296826,63990586,99164785,185362402,172666415,80153436,128219201,183261823,51062990&place_id=any&verifiable=any
you can further add a taxonomic filter, e.g. incorrect verifiable Lepidoptera
https://www.inaturalist.org/observations?id=171296826,63990586,99164785,185362402,172666415,80153436,128219201,183261823,51062990&place_id=any&taxon_id=47157&verifiable=any
I think you'll find that with this small sample size of 1k and given the relatively rare incidence of incorrect obs, you run into small sample size issues for these fine grained taxonomic subsets.
It sounds like people are ok to try increasing the sample size from 1k to 10k for v0.2?

Posted by loarie 3 months ago

I agree with that. Bigger should be better.

One thing to consider: There seems to be some agreement that the most commonly observed taxa are for the most part correctly IDed. So would it make sense to exclude taxa with say >100K+ observations from the sample? Or maybe run 10 experiments of 1K each focused more narrowly on problematic taxa? (e.g. fungi, taxa with <500 observations (could be any taxa), African continent excluding mammals. Other folks might have better ideas. It might depend to some extent on what the goals are. Is it to get an overall sense of the accuracy of iNat dataset or to "stress-test" it by looking for problems with difficult taxa?

Would it be possible to poll the iNat community in some way or some subset of highly engaged users?

Posted by matthias55 3 months ago

@seanblaney the thing is, if a user has a picture of a rabbit and a toad in the same observation and you make it a Research Grade rabbit on the basis of the first picture, iNaturalist is still going to end up sending a picture of a toad identified as a rabbit to places like GBIF. Which is not a good look!

In addition, how do we know that the date, time and location are appropriate to both animals?

Posted by matthewvosper 3 months ago

@igor117 I take your point that while 95% sounds high it does indeed mean 5% is potentially incorrect. But 95% confidence is pretty much the standard in probabalistic tests across the board in much of science. Results are published in respected journals if there is at least 95% chance that they represent genuine findings rather than random chance. That also means as many as 5% of scientific papers could be drawing false conclusions; perfection is really hard to achieve! So if 95% is the standard in professional science, it's indeed impressive that iNat is achieving that level of accuracy in citizen science.

Posted by danielaustin 3 months ago

@danielaustin it's not applicable to species ID, it involves 5% wrong data even before the analysis is started, for most of the studies. Let's say we have 10,000 specimens of 100 species as a basis for some study. If for each ID we have 95% accuracy, then we for sure have a bunch of wrong IDs, about 500 wrong IDs in one study. How many additional species can it involve? We know for sure that the study is based on the wrong data with such numbers, near 0% chance that it represents correct findings. It's trashy science. If I want to make a map of some species' distribution with 1,000 findings and I know that the accuracy of thier ID is 95%, then there could be random dots anywhere outside actual distribution and it makes no sense with such dataset. To make an analogy, let's say we're analyzing how efficient cancer treatment is in 1,000 patients. But we know that only 95% of the patients were human, and the other random 5% (50 patients) are mice and rabbits or maybe some other species (as a result of partial data lost). How accurate will the results of this study be? So, answering the question of @loarie above, accuracy should be about at the same level as in the study with patients identified as Homo sapiens. Some singular mistakes are unavoidable in most of the groups, but to be used in science, in most cases, it should be much lower than 1%.

Posted by igor117 3 months ago

Regarding the plans to do these experiments on a monthly basis: I think some care may be required here to make sure that identifiers continue to be willing to participate, particularly if you intend to increase the sample size. For example: is the plan to ask different people each time, and if so, what happens with taxa where there are few identifiers?

I was happy to do this once. However, the observations I was assigned either weren't taxa I found particularly exciting (honeybees) or they were out of my usual identification region so I had to spend a lot more time researching in order to provide a meaningful ID. This is a quiet season for many of us (northern hemisphere winter), but if I were to be asked to look at 50 such observations instead of just 5, on a monthly basis, in the middle of the summer when the new observations are coming flooding in and it's difficult to keep up, I would probably be inclined to decide that adding a fourth or fifth ID to observations that have already been reviewed (before the experiment or by other participants before me) is not a good use of my time when there are so many observations that have yet to be looked at by anyone.

Posted by spiphany 3 months ago

@igor an accuracy value of 95% is not even remotely close to 'trashy science', you grossly overestimate the general data accuracy rates of professional studies across the sciences (all fields)

Posted by thebeachcomber 3 months ago

It would also be relevant to interpret the results from these experiments, including determining whether the 95% accuracy result is an overestimation or an estimation that lacks sufficient sample size to know, by also estimating and comparing it with the current accuracy of the Computer Vision ID suggestions. The fact that the CV accuracy remains currently somewhat low could be another indication that the RG accuracy is actually less than 95%.

Re: the discussion over whether even a true 95% accuracy would be considered good enough for GBIF, that would be somewhat high but it's also for a specific form of data where misidentifications aren't supposed to occur, where a museum specimen is typically supposed to be correctedly determined. And again, the current accuracy seems to be below 95%, which would mean a greater than 5% error rate. Whatever the actual current accuracy is, 95% would be somewhat high but a better goal or standard would be 97-99%. At the same time, it's possible that particular wildlife groups currently have a 95-99% accuracy. This is definitely true of the honeybee genus Apis (which has multiple species), where about 100% of the observations are already RG.

Posted by bdagley 3 months ago

For "a museum specimen is typically supposed to be correctly determined", the word supposed is doing some heavy lifting here. The unfortunate reality is that many museum specimen collections have an ID accuracy of lower than 95%, for many taxa

Posted by thebeachcomber 3 months ago

as a good starting point exploring ID accuracy in museum specimens, here is a recent paper looking at "all Texas land snail collections from the two major repositories in the state". One of their main findings was that:
"Species misidentification rate was approximately 20%, while 4% of lots represented more than one species. Errors were spread across the entire shell size spectrum and were present in 75% of taxonomic families."

https://doi.org/10.1111/geb.12995

Posted by thebeachcomber 3 months ago

As mentioned, I'm mostly referring to comparisons to bee and wasp collections.

Posted by bdagley 3 months ago

I really do think there is some popularity bias that is naturally going to happen because of the way computer vision works - especially with cryptic taxa, taxa that really require things like dissection or microscopy to identify.

Here's one example (I may have brought this up in the other thread)

Amanita bisporigera has 1,748 RG observations in North America (11,044 total in NA). Amanita suballiacea, a very close lookalike, has 59 RG (81 total) in NA.

But if you look at sequenced specimans, there's 51 sequences of A. bisporigera, and 28 of A. suballiacea. It's a much smaller sample size, of course, but that ratio strikes me as massively, suspiciously different.

The accuracy for some taxa is probably much, much lower than 95% in actuality. Though for anything that isn't cryptic, I believe the 95%.

Posted by lothlin 3 months ago

lothlin's comment makes me also think that for groups where there's lots of taxonomic uncertainty like Fungi - meaning not a clearly defined, and finite set of leafs - there's only so far we can reduce identification uncertainty before we are limited by taxonomic uncertainty.

Posted by loarie 3 months ago

I cannot find myself on the list of validators even though I participated.

Posted by arboretum_amy 3 months ago

This seems like something that really depends on the taxa. I would expect that the accuracy of research grade identifications is really high for mammals and birds, but much lower for (non-butterfly) insects and spiders. Future studies should explore this in more detail.

Posted by naturalist_jared 3 months ago

@spiphany if the sample we are offered jumps, I will ID as I do across my bookmarked URLs.
Engage with what interests me, and leave the rest in Mark as Reviewed. (Which would reflect the ID 'skills' for which I was chosen) I prefer to do the rough sorting, to make a batch available for taxon specialists. Californian poppy obs is an interesting discussion - but I left comments, and no ID.

Perhaps @loarie needs to offer a taxon set targeted for relevant specialists.
And a more general one for second tier identifiers like me - don't care what or where, if I can, I will nudge it along to those who can ID it.

On the other hand - if people want blind testing - then out of your comfort zone is indeed blind.
For the 2 obs I was offered - out of this experiment - they would have been - Mark as Reviewed - Next.

Posted by dianastuder 3 months ago

@arboretum_amy Maybe the list of validators itself is only 95% accurate ;)

Posted by danielaustin 3 months ago

@thebeachcomber, you should read the conclusion of this paper. They never said that it's a normal misidentification rate, quite the opposite, they are pointing out that "Researchers should limit their use of museum record data to situations where their inherent biases and errors are irrelevant, rectifiable or explicitly considered. At the same time museums should begin incorporating expert specimen verification into their digitization programs." I don't think that there were some well-known experts on the land snails in Texas, so those collections were probably identified by non-experts. Like, if in some museum in Texas the ID-rates suck for some one group, it doesn't mean that it should be that way everywhere. The point of this article is rather that if the situation is that bad, it's by itself worth a publication in a top journal.

Posted by igor117 3 months ago

@igor117 I am well aware of the paper’s contents and conclusions :)
Have you looked at many natural history collections for yourself? Misidentification rates as bad, or even worse, as those found in that paper are certainly not rare whatsoever for many taxa and many parts of the world.

If “ The point of this article is rather that if the situation is that bad, it's by itself worth a publication in a top journal.” were true, then there are thousands and thousands of papers just begging to be written

Posted by thebeachcomber 3 months ago

@thebeachcomber in the matter of fact I have, I'm even a curator of one, misidentifications are very rare (except outdated taxonomy) if those are major collections that are or were curated by experts, and almost every such mistake is at the level of begging a publication. Of course some collections that were just a side thing for someone, or something like this, usually smaller ones and/or in smaller institutions, is another case, but it's rather something that requires revision by expert, not uploading to GBIF.

Posted by igor117 3 months ago

I guess we'll have to agree to disagree. There are plenty of collections I know of with many misidentifications. It's great your collection is well-identified, but it is an exception, not the rule. Many cases like these are, as you say, due to an expert having not yet curated the collection, which is actually quite prevalent; not just in small collections as you claim, but also including in large collections. There are numerous major natural history collections with literally millions of lots; it is absurd to think that all of these specimens have been reviewed by an expert, there simply aren't enough of them/not enough time. So the fact that a collection is a 'major' one is often irrelevant, and indeed in some cases a major collection is far more likely to have multiple poorly curated taxa by virtue of its sheer size and diversity because, as I said, it is not reasonable to expect all of the taxa and specimens included in the collection to have been reviewed when you're dealing with those kind of magnitudes

Also having said that, experts are not infallible, they also make mistakes, and this is especially the case where an expert may have last reviewed a taxon many years ago before a revision of the group occurred. So just because a particular collection has been reviewed by an expert does not guarantee that misidentifications in it are rare

Posted by thebeachcomber 3 months ago

@thebeachcomber well, with "the case where an expert may have last reviewed a taxon many years ago before a revision of the group occurred" it's not the case of misidentification, but the case of outdated taxonomy, of course some name that was used in ID 100 years ago may refer to 10 different species in current taxonomy, but it does not mean that it was mistake at time of ID. Of couese it's something that should be considered in work with old collections and it is difficult to understand something in such collections without good knowledge in the history of taxonomy in certain geoup, it's also something that should not be uploaded in GBIF without modern revision.

Posted by igor117 3 months ago

Overall, as multiple people have said, the accuracy of a museum collection, for example an insect collection primarily identified by experts would have higher accuracy than the global accuracy of that wildlife group on this website, except in cases of groups where identifiers made unusually extensive reviews, like in some bee and wasp groups.

Consider that all that RG status requires in some cases is for two people to use incorrect CV/AI ID suggestions or to guess incorrect IDs on an observation. And, when an observation becomes RG it then becomes obscured from many or most identifiers because it's no longer shown in the default Identify Needs ID observation filter. So, except for certain groups where identifiers recheck the IDs of already-RG observations, we even know that there are a percent of purely guessed or wrong CV/AI IDs in the RG collection of observations for many wildlife groups. We also know that the CV/AI accuracy on it's own is somewhat low, at least for cryptic groups like bees and wasps. So, in a at least some ways, neither the overall/average percent accuracy of all wildlife groups nor the means by which the records were identified for some of the groups are at all comparable to an animal museum collection mostly identified by experts. If the RG accuracy were really so high, the CV/AI accuracy would also be expected to be at least somewhat higher.

Posted by bdagley 3 months ago

"If the RG accuracy were really so high, the CV/AI accuracy would also be expected to be at least somewhat higher."
This is not true. The CV does not actually learn the traits of the organism in question; it learns what features are typically present in photos of a particular organism -- this may be an association with a particular pollen plant that also happens to be visited by other species, or it may confuse a yellow, pollen-filled scopa with the color of the scopa itself. Because it isn't trained on observations with a higher-level community ID, the training set is skewed and there are many groups it doesn't know at all. I agree that wrong CV suggestions are a problem, but they are not typically the result of it being trained on observations that were ID'd incorrectly; they are the result of other limitations of the training.

Posted by spiphany 3 months ago

@eric-schmitty"If/when another round is run for this experiment, it would make sense to take locality into account when choosing identifiers. Maybe we could do it as we are currently doing it, but add on the caveat that observations will only be given to people if they are within a 100km radius of the observation."

I'm based in the UK, but much of my taxonomic work was undertaken in the tropics and was based on particular groups of fungi. I can be more puzzled by a fungus from my garden that's outside my area of knowledge than a fungus from the other side of the globe that's within my area of knowledge - so I would not limit identifiers by location.

Posted by mycoscope 3 months ago

@ spiphany "If the RG accuracy were really so high, the CV/AI accuracy would also be expected to be at least somewhat higher."
This is not true.

Actually, there should be some positive relationship between the two, despite what you said (which I didn't say) about "The CV does not actually learn the traits of the organism in question; it learns what features are typically present in photos of a particular organism." Suppose the RG accuracy were 97% or 99%, would there still be no change in CV accuracy? The reason why there would be is that in some ways RG accuracy is related to the accuracy of Needs ID obs., at least ones that have received experienced IDs. The CV is supposed to at least gradually improve over time at least partly due to identifiers making accurate IDs given that it's learning is based on training on those images. As a partial side note, it would seem strange if current RG accuracy really were so high (which is mostly from manual IDs, non-CV IDs) while CV accuracy remained so low. That would indicate the currently common view that at least currently, the CV results in more misidentifications then it's preventing.

A separate point I made is that many RG obs. were made RG by only two users using incorrect CV IDs and/or guessing incorrect IDs. But, the results of this experiment are phrased as applying to all RG obs., which is misleading. The title should indicate that the estimation is only for a small number that were intentionally validated by identifiers with prior improving IDs. Given that many other obs. are currently RG based on guessed IDs or incorrect CV IDs, the overall accuracy of RG must be below 95%. I, and others here and on the previous thread, have also given many additional reasons why the experiment design and interpretation of results seems inaccurate (which also makes comparisons to museum collection accuracy less relevant, until we know the actual accuracy). As mentioned, groups like global Apis species, eastern US and Canada Bombus or Vespidae species, Americas Sceliphron, and North America Sphecius were recently intensively reviewed and identified, with very high percents of observations RG (e.g., 95%, 99%). I estimated that some of those percent RG stats will never be as high as they are during this winter season again. So, if people really wanted to estimate the accuracy of RG obs., it would be relevant to design some somewhat modified further studies that at least partly focus on groups known to be intensively identified by multiple validators in very high sample sizes.

Overall, there seems little further reason to overly interpret or debate the reported 95% RG accuracy, since even from the first journal post on this many commenters immediately pointed out issues with the experiment design that mean such an estimation (which is phrased at least in the title as if it applies to all RG obs.) can't accurately be made. This has also been true of past similar experiments. Part of the reason is the same, that identifiers typically aren't consulted for input on the design before the estimations or experiments were started or conducted. I'm not against people conducting these experiments, but am only assessing what they actually do and don't show. For example, it would be very misleading or inaccurate if the 95% RG accuracy result were to be published in additional public sources. At the same time, a different kind of properly designed experiment could probably accurately report that the RG accuracy is very high (possibly near or above 95%) for particular groups in locations that have been intensively identified by validators. Finally, note also that this would mean it's the validators doing most of the work achieving the high accuracy, not the CV, and even despite the CV as a counteracting force in some ways. But, the title also doesn't distinguish that this is a non-CV experiment, so some could misinterpret it to mean that the CV accuracy has also become high.

Posted by bdagley 3 months ago

Maybe best to limit by the combination taxon+loction - but location of identifications, not location of observations. Someone can know a broad taxon in one place - but something else - usually a subset in another. I can say something useful about hoverflies anywhere, but I can only do greenbottles in Europe. And I can basically only do myriapods in the UK.

Posted by matthewvosper 3 months ago

If the next experimental set were in a project - and identifiers could pick out taxon and location to suit themselves - I wonder what result that would give.

Posted by dianastuder 3 months ago

I am not sure how the suggested identification tool works, but it seems to create problems, at least for fungi. Take the mushroom Agaricus hondensis as just one example. Not a species I have previously heard of, but there are 57 records from the UK and the overall map indicates a more or less worldwide distribution. A bit of investigation shows it is a species originally described with redwoods in California, with a range extending along the North American west coast. Restricting the map to research grade gets rid of most observations outside the North American west coast, but still leaves a handful of British records.

Clicking on some of the UK records, I find that Agaricus hondensis comes up amongst the species suggestions with the reassuring note "expected nearby". At least a couple of other American Agaricus species come up as "nearby" suggestions too. It would seem that a few misidentifications can encourage the identification tool to suggest yet more misidentifications. It would perhaps be helpful if it didn't exist.

Posted by mycoscope 3 months ago

@mycoscope it takes one ID, on one obs, to trigger Seen Nearby. Motivation to tidy up distribution maps when the first few happen, to pre-empt more. 'Seen Nearby' needs to have more basis than just the one.

Posted by dianastuder 3 months ago

@dianastuder since the introduction of the Geomodel that should no longer be true, as I understand it. 'Seen Nearby' no longer exists, and its replacement 'Expected Nearby' should be more conservative. However, any system is only as good as the input data, and since the geomodel is only a couple of months old it will still be thrown off the scent where there is a significant body of legacy misidentifications - as in this case (I presume).

Posted by matthewvosper 3 months ago

@mycoscope I agree with you on that point then. I still think having locality play a part is extremely important, but maybe on the observations you've IDed. So, an input of taxon and locality as @matthewvosper suggested.

For example, we can use my IDs of Woolly Burdock:https://www.inaturalist.org/observations?ident_user_id=824054&not_user_id=824054&place_id=any&subview=map&taxon_id=124815
Whatever API they use to find identifiers could basically make a 100km radius around each of the observations that you have identified for the taxonomic group, and if the observation to be identified fits in that range, then you will be selected to review the IDs for that area. Now, of course, you could probably vary the range depending on the group (birds vs plants), but I feel like taking into account SOME form of locationality when choosing identifiers is very important.

The main reason why I am so big on this is because this last round asked me to ID Common Ragweed. Now, looking at the observations I have IDed before, they have ALL been in Wisconsin: https://www.inaturalist.org/observations?ident_user_id=824054&not_user_id=824054&place_id=any&subview=map&taxon_id=124815

The only two which aren't: https://www.inaturalist.org/observations/26614604 and https://www.inaturalist.org/observations/144424875 are from this experiment. For me, I do not feel that my IDs are reasonable whatsoever, because while they match the morphological traits of Common Ragweed, I have no idea if there are other cryptic species in this region.

Posted by eric-schmitty 3 months ago

Ok, so ignore the jankiness, because I was reusing code, but maybe, a map of the localities would be someone like this for the Woolly Burdock Observations from earlier: https://drive.google.com/file/d/1veRAaye6APlN7b01ka6LW6C-HVroT13Q/view?usp=drive_link

Now, it would probably make more sense to do this through raster data because I'm pretty sure how most of the iNat data works, but the same idea applies.

Posted by eric-schmitty 3 months ago

@mycoscope CV really does fungi dirty, especially in macroscopically cryptic genuses like Agaricus (see the note I made above about Amanita sect. phallodeae species observations.)

I honestly think that the fix to this would be to massively downplay how readily the CV suggests species-level identifications for fungi - Genus is usually not so bad, but it suggests wrong suggestions at the species level constantly.

Posted by lothlin 3 months ago

I also noticed that.

Posted by bdagley 3 months ago

Wow

Posted by ck2az 3 months ago

95% is quite encouraging.
But I think there are ways that could make it much higher.
The biggest problem in iNat is that a novice posts something, someone else adds an ID, and the novice instead of simply withdrawing their ID, duplicates the other poster's. Instant RG for what is effectively 1 ID not 2.
Without forcing anything upon anyone, I think there are ways to improve this situation.

A periodic impersonal reminder to everyone, or all it may be relevant to based on some calculation, which could be performed during the act of adopting a new ID, that wrong IDs should simply be withdrawn unless properly assessed to be the new one.
An option for people to submit initial ID as "this is a guess" v "I'm sure". Guesses would count as 0 not 1.
I'm sure other ideas are available.
Cheers!

Posted by meteorquake 3 months ago

Better onboarding could help with that.

Posted by dianastuder 3 months ago

Perhaps identifiers were more cautious than usual or otherwise changed their behavior because they are part of the experiment?

On the other hand, being forced (by terms of the experiment) to ID all images may introduce more error than usual (some IDers only routinely ID images well within their comfort zone)

Posted by johnascher 2 months ago

@meteorquake other possibilites are to count as zero toward RG any identifications by people who are "unreliable" by various objective metrics (just joined site, few observations or IDs, too many maverick IDs, etc.)

Posted by johnascher 2 months ago

I'm certainly up for some automated system and I think if done well people will understand it as for the greater good.
I'm not sure that being a new joiner etc is helpful because you can get experienced joiners but I would certainly update retrospectively any grades, so for example if someone proves to be recalculated as unreliable or reliable in a certain area their older gradings (whether posting or IDs) can be reduced or increased accordingly, updating any posts accordingly into or out of RG. But it must be borne in mind in such an algorithm someone can be good at say plants in one country but a learner of plants in another, or in a country good at mosses and poor at plants etc so it would have to be an intelligently created calculation, and take account of the fact that the AI algorithm may make people appear to look experienced when they're just selecting the suggested offering, and be aware of easy v hard IDs.
What I think would be a much better alternative would be for people to self-declare their experience as 0, 0.5 or 1. So you might say you are good at fungi in Britain and your IDs should have a 1, assign your plant IDs in Britain as 0.5 because you're an experienced learner, and anything else not self-declared would be 0 rated because I think for the most part people would only self-declare for areas they do have experience with, and would do so reasonably correctly, any that didn't do so would be relatively few and would get their IDs corrected by the community voting, and anyone with an obvious discrepancy of self-declared ratings would get algorithmically bumped.
However at the end of the day I think the most flexible system is that on offering an ID you have more than one button, ideally three, i.e. Guess, Probably, Certain.

Posted by meteorquake 2 months ago

Add a Comment

Sign In or Sign Up to add comments