Paper 017: Issues in Compiling and Exploiting Textbook Corpora

anthony September 29, 2020 Uncategorized 16 Comments

Paper 017: Issues in Compiling and Exploiting Textbook Corpora

LE FOLL, Elen (Osnabrück University, Germany)

Keywords: Corpus Design, Coursebooks, Pedagogical Materials, English as a Foreign Language, Classroom English

Abstract

Textbooks are known to be “one of the most important educational inputs” (Pingel 2010: 7) and, as such, have long been cherished objects of research, especially in the social and political sciences. In applied linguistics, page-by-page analysis of textbook language was once a difficult, time-consuming process. However, the development of digital data storage and retrieval enabled Mindt (1987; 1992) to pioneer a new approach to language textbook analysis using computer-readable textbook corpora. Since, multiple studies have highlighted the potential of corpus-based textbook analysis both for materials development and evaluation, and as a means of capturing learner language input and better understanding learner language production (cf. Römer 2004; Meunier & Gouverneur 2009).

However, in practice, compiling textbook corpora can be an arduous task, fraud with potential pitfalls. In this paper, I highlight some of the issues specific to compiling a corpus of textbooks and propose possible solutions. The compiling process begins with the selection of the materials to be included in the corpus, a process driven by the research questions. I discuss questions pertaining to the design of the sampling frame, corpus size, representativeness, and balance (cf. Biber 1993). Due to the many different font types, colours, and often complex page layouts, automatic OCR (optical character recognition) is usually much more complex for textbooks.

For many corpus linguistic measures, the basic unit is the text. However, textbooks are essentially collections of many different texts. Moreover, textbooks also pose problems due to some texts being extremely short (e.g. instructions), whilst others can span several pages (short story). In addition, what constitutes a textbook must also be defined since educational publishers now offer multimedia packages that include coursebooks, workbooks, teacher’s guides, audio and video materials, etc.

These issues in compiling and exploiting textbook corpora are illustrated with examples from the compilation and analysis of a manually annotated textbook corpus of nine series (43 volumes) of secondary school EFL textbooks.

References

Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
Meunier, F., & Gouverneur, C. (2009). New types of corpora for new educational challenges: Collecting, annotating and exploiting a corpus of textbook material. In K. Aijmer (Ed.), Studies in Corpus Linguistics (Vol. 33, pp. 179–201). John Benjamins.
Mindt, D. (1987). Sprache, Grammatik, Unterrichtsgrammatik: Futurischer Zeitbezug im Englischen I (1st ed.). Diesterweg.
Mindt, D. (1992). Zeitbezug Im Englischen: Eine Didaktische Grammatik Des Englischen Futurs. Gunter Narr Verlag.
Römer, U. (2004). Comparing real and ideal language learner input: The use of an EFL textbook corpus in corpus linguistics and language teaching. In G. Aston, S. Bernardini, & D. Stewart (Eds.), Studies in Corpus Linguistics (Vol. 17, pp. 151–168). John Benjamins.

Presentation video

Supplementary Information

JAECS2020_TextbookCorpora_LeFoll Download

Q&A live (Zoom) session

No longer available.

16 Comments

Yukie Kondo
September 30, 2020 at 1:30 pm

This looks to be a very interesting paper. I hope you enjoy the conference! – Organizing committee
Elen LE FOLL
October 2, 2020 at 10:25 am

Please note that since the topics have much in common, the live Q&A session for this paper will be shared with the Q&A session for “Investigating Publisher Application of Corpus Research on Recent Language Change To ELT Coursebook Development” by Niall CURRY (Coventry University, UK), Robbie LOVE (Aston University, UK) and Olivia GOODMAN (Cambridge University Press, UK). Watch their talk here: https://jaecs2020.laurenceanthony.net/paper-016/

We look forward to your questions and comments on all things textbook-related!
joshtjordan
October 3, 2020 at 2:19 am

I see the benefit in how you annotated text register. This would be a useful metric for differentiating various textbooks even if you weren’t going the extra step and comparing them to target language corpora. Teachers who are offered a choice between textbooks may want to know which ones contain more conversation, fiction, or informative text, for example.
iskwshin
October 3, 2020 at 2:59 am

Thank you for an interesting talk. I do understand that to represent varied English textbooks as a single genre could be very difficult. This is because the content of the textbooks is easily influenced by the country, the age, and the proficiency of the target learners. This time you collected the textbooks from three countries (Germany, France, Spain). Did you see some substantial difference among the textbooks published in different countries? Or were they similar enough? (Shin Ishikawa, Kobe U)
- Elen LE FOLL
  October 3, 2020 at 6:42 am
  
  Thanks for your interest and question. For all the comparative analyses I have done so far, there were have been so significant differences between the three countries of use. However, I have found some noteworthy differences between textbook series.
  - iskwshin
    October 4, 2020 at 2:13 pm
    
    I see. Thank you for your reply!
drprc80
October 3, 2020 at 3:20 am

A great presentation Elen, I have a PHD student also working on this for MWUs. In our research we were only interested in the reading and listening passages, having these as separate ‘texts’ in the corpus. Given Jesse Egbert’s earlier comments on ‘texts’ vs ‘corpus’, and unless I missed this in your talk, how did you deal with the ‘text’ issue, as presumably having a whole textbook as a text isn’t the way to go? Perhaps this is addressed in your manual register annotation section, but from what I can gather these are ‘subcorpora’ rather than texts?
Best
Peter
- Elen LE FOLL
  October 3, 2020 at 6:44 am
  
  Thanks for your question Peter. We had a chat about this in the live Q&A but for anyone else who might have the same question: yes, I did annotate individual texts as well as their register and I can therefore obtain the frequency of linguistic features per text which, for some analyses, especially the MDAs I have been doing, is crucial of course.
ishikawa.yuka
October 3, 2020 at 3:52 am

Thank you for your very interesting and very informative presentation. I’m interested in your textbook corpus and textbook analysis. Textbooks contain a lot of “exercises”, where students see fragmentary sentences, a part of phrases, or ungrammatical or unnatural phrases. How did you deal with them? Thank you fo r a wonderful presentation! Yuka Ishikawa, Nagoya Institute of Tech, Japan
- Elen LE FOLL
  October 3, 2020 at 6:47 am
  
  Thank you very much for your interest and question! The exercises were annotated either as “words & phrases” or “contextless sentences” unless they formed a coherent text that was clearly designed to represent a particular register, e.g. a conversation or an informative text. I did not, however, do any more complex exercise annotation like Gouverneur & Meunier, whose corpus design I also briefly presented in my talk so, for instance, I did not annotate “ungrammatical sentences” that learners are instructed to correct in a different way (though I must say that such exercises were very rare in the textbooks that I included in my corpus).
MartinSchweinberger
October 3, 2020 at 4:17 am

Hi Elen,
Really great talk and very well presented! I was wondering when and how your corpus will be published and how anyone interested will be able to access or download the data?
- Elen LE FOLL
  October 3, 2020 at 6:52 am
  
  Hi Martin,
  Thanks for your kind comment! I was lucky to obtain some of the textbooks directly as text files from the publishers but, unfortunately, only under the condition that I would not share the corpus afterwards. We discussed this issue in the live Q&A session because we realised that, so often, there are many of us working on similar projects and investing lots of time and resources in partially duplicating such efforts which seems like a real waste. As a PhD student, when I embarked on this project, I must say this was the most distressing aspect: having to deal with nontransparent copyright law and not knowing who to turn to for advice. Are you aware of some good guidelines on these issues?
  - MartinSchweinberger
    October 4, 2020 at 5:30 am
    
    Hi Elen,
    
    Again: great talk. Unfortunately, I cannot give any sound advice on copyright-related issues (also because the copyrights differ dramatically(!) from country to country). It was really just an honest question. One reason being that data will likely become equivant to other forms of publication and if you thenhave a corpus that is re-used, then this will count as a citation (which is good for promotions in the anglophone world)…
mrkm-a
October 3, 2020 at 4:32 am

Thank you for your interesting talk! My comment is shameless self-promotion, but I thought you might find my MDA-based work on textbooks useful to partially contextualise your study (https://tinyurl.com/y3hc2mjt), although I only compared among textbooks in different countries and not between textbook language and authentic language like you did. Also, how are you planning to use dependency parsing in your research? I am not aware of any study on textbooks that exploited it, and think you may be able to find something interesting if you use the parsing well.
- Elen LE FOLL
  October 3, 2020 at 6:55 am
  
  Thank you so much for pointing me towards your study! I must admit I was not aware of it, which is such a shame because it ties in very well with what I have been working on! I very much look forward to reading it.
  
  As for dependency parsing, yes, I have parsed my textbook corpus and my reference corpora with Spacy and have been using the parsing for some of my comparative analyses, though not for any of my MDA attempts so far.
Natalie
October 3, 2020 at 9:57 pm

Hi Elen,
My comment is a thank you, rather than a question – great talk, very clear and easy to follow! I was delighted to see your presentation as I originally had a study as part of my PhD which looked at the frequency of use of topic vocabulary in German textbooks for learners with English as an L1, which I ended up cutting because I couldn’t find convincing solutions to several of the issues you address here! I may well come back to it though and if I do I will certainly draw on some of your techniques – would be great to talk more!

Comments are closed.