Paper 037: A Validation Study of the Accuracy of Lexical Diversity Tools
HU XIAOLIN, Tokyo University of Foreign Studies
Keywords: lexical diversity, tools evaluation, accuracy
Abstract
In recent years, a variety of lexical diversity tools are available for easy access to the analysis of a large amount of texts or corpora (Stuart Webb, 2019). However, few studies focus on the accuracy assessment for these tools.
This research aims to evaluate the accuracy of the lexical diversity tools and to explore the possible factors affecting the outputs of different tools. To this end, six widely used tools were tested on 24 transcriptions from NICT JLE Corpus (Izumi et al. 2004). The tools were divided into two groups according to their lemmatization principles. Group 1 (no lemmatization, all forms treated as different wordforms, i.e. types) contains Text Inspector (Bax, 2012), VocabProfile Program (Cobb, 2002), Coh-Metrix (Graesser, 2004) and CLAN (MacWhinney, 2000); Group 2 (with lemmatization, all inflected forms treated as one type) contains CLAN (MacWhinney, 2000), Lexical Complexity Analyzer (Lu, 2012) and TAALED (Kyle & Crossley, 2015). The type/token ratio (TTR) and Guiraud’s index obtained by manually counted types and tokens were used as gold standards to evaluate the accuracy. This research only focuses on TTR and Guiraud, but the accurate counting and recognition of types and tokens will affect the other metrics as well.
The results show that in Group 1, Text Inspector has the most accurate mean scores in TTR against the gold standard compared with other tools (Error percentages: Text Inspector 0.33%, Coh-Metrix 1.00%, VocabProfile 2.22%, CLAN 4.20%). In CLAN(unlemmatized mode), a positive correlation between the proportion of contracted forms in texts and the error rates of TTR was observed (r = 0.95). For Group 2, CLAN(lemmatized mode) had the most robust performance in TTR (CLAN 0.70%, Lexical Complexity Analyzer 2.43%, TAALED 13.74%). Although a smaller difference was observed in Guiraud’s index, Text Inspector (Group 1) and CLAN (Group 2) outperformed the other tools. Therefore, tools for lexical diversity should be selected carefully according to different data types and research aims.
Reference
Bax, S. (2012). Text Inspector. Online text analysis tool. Available from: https://textinspector.com/.
Cobb, T. (2002). The Web Vocabulary Profile. http://www.er.uqam.ca/nobel/r21270/textools/web_vp.html
Graesser,A., McNamara,D.,& Louwerse,M (2004), Coh-Metrix, available from http:// www.cohmetrix.com
Izumi, E., Uchimoto, K. and Isahara, H. (2004). The NICT JLE corpus: Exploiting the language learners’ speech database for research and education. International Journal of the Computer, the Internet and Management ,12(2): 119–125.
Kyle, K. & Crossley, S (2015) TAALED, available from http://www.linguisticanalysistools.rog
Lu, Xiaofei (2012). The Relationship of Lexical Richness to the Quality of ESL Learners’ Oral Narratives. The Modern Language Journal, 96(2), 190-208.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk: Volume 1: Transcription Format and Programs. Volume 2: The Database. Lawrence Erlbaum Associates.
Webb, S. (2019). The Routledge Handbook of Vocabulary Studies, London: Routledge
Presentation video
Supplementary Information
None
Q&A live (Zoom) session
No longer available.
This looks to be a very interesting paper. I hope you enjoy the conference! – Organizing committee
@Hu, thank you for the very interesting talk. I think this is an extremely important area of work. Too few researchers critique the tools they use, and you make a good effort to address this issue. I liked your focus on just two measure that are the underlying measures for many others. This was a great design choice. However, it would have been good to clearly explain what your Gold Standard definition of a token was (e.g. including apostrophes etc) and changing the settings of the tools to match this before starting the analysis. You seem to be comparing your ‘gold-standard’ settings against ‘default’ tool settings, and then describing the tools as ‘inaccurate’ if they did not match your results. Here is a question: Did you try matching the settings? If you did, how did the results change?
I very much enjoyed the presentation. Thank you!
Thank you so much for your reply. Great questions!
Sorry about the miss of clear definition of the token. In the slides(page-9) examples were showed about how tokens were counted in the Gold Standard(manual counting), in which I tried the most common way to fit most tools (punctuations like apostrophes and hyphens are not included in tokens. But numbers are included in one token, since some private information is replaced by string like XXX01).
About the default setting issue. I know it’s feasible to change the setting for definition of the token in Ant Conc, while the issue is that very few tools provide this opportunity. Moreover, some tools are not transparent enough to show how tokens/types were counted.
For sure, my manual counting is not the only “accurate answer”, some tools are calculating in their own rules. The thing I’d like to remind researchers is that such disagreement in token/type calculation might cause a big difference in result. So it might be a good idea to check the reliability by a pilot test, in order to pick the one that actually makes the analysis as the way how we want them to do.
Looking forward to talking to you in QA time, if you still have something interested.
An excellent talk. As you suggested, tokenization is really a nuisance for vocab analysis. I totally agree with your comment at the end of your talk. All the corpus-users need to be careful about how to use software (and how to use data). Shin Ishikawa, Kobe U.
Thank you for the comments. Always a big fan of your work!