From the sidelines into the spotlight – addressing problems involving the use of linguistic corpora

Mark Kaunisto reminisces about his experiences as a user of linguistic corpora and talks about a new book that he has recently co-edited.

I was first introduced to the concept of linguistic corpora as tools for studying language use in the 1990s, during the third year of my undergraduate. The corpora that were commonly used in those days were the Brown Corpus and the London-Oslo-Bergen Corpus, which included samples of different forms of written American and British English, respectively. Both corpora contained about one million words of text, and it was acknowledged that while they were good for studying high-frequency phenomena in language, larger sets of data would be desirable.

The software that was used to study was a DOS-based (!) concordancing programme called WordCruncher. The use of the programme took some time to get used to, but the commands overall enabled one to perform rather complex queries from the materials.

It did not take long for personal computers to become more efficient, increasing quite dramatically the processing speeds and the ability to handle larger amounts of data. Before too long we got the 100-million-word British National Corpus, which seemed absolutely massive. And other large corpora followed soon after.

Having used different kinds of corpora for some years already, I remember how I was starting to pay more and more attention to items among the search hits which seemed out of place, or things that in general seemed to require closer inspection. Not everything that the search engines produced was neat and tidy, but a degree of post-processing of the search hits was (almost always) needed.

Later, as I was teaching the basics of corpus linguistics to students, I noticed that I was frequently reminding students to take notice of different types of small things that could in some instances skew the search results. Remember this thing. And that. And that. Oh and whatever you do, please don’t forget that. Often the things related to corpus annotation, sometimes human errors in the corpus metadata, sometimes just bad choices made during the compilation of the corpus. And after a while, these constant reminders made me ask myself whether I was actually discouraging the students from using corpora rather than encouraging them.

Such potentially troubling aspects of using corpora have not escaped corpus linguists, or the compilers of corpora. Quite the contrary. In 1989, Matti Rissanen, professor of the English department at the University of Helsinki, and one of the compilers of the Helsinki Corpus of English Texts, wrote a short but very insightful article “Three problems connected with the use of diachronic corpora”, published in ICAME Journal, about how linguistic corpora may, in fact, distort naive users’ perceptions of language study. We may start to think that corpora represent language 100% fully and reliably – which they don’t – and we may feel overconfident about commenting on aspects of language use in genres that we aren’t necessarily very familiar with.

I’ve always liked Rissanen’s article in all its conciseness, and as an article written by a scholar who would be expected to primarily promote the use of corpora, it shows great responsibility with its advice for users to approach corpora with due caution. In addition to the problems that Rissanen envisaged, there might be other problems and challenges worth noting. The increased speed with which we can find relevant hits may lead us to appreciate speed in the wrong way, and the technological advances reflected in more elaborate search interfaces may even result in an illusion, or an unwarranted expectation, of total reliability and flawlessness. “The Fallacy of Sophisticated Technology” might be one way to describe the issue, to adopt the style of Rissanen’s article.

Corpus linguists are usually well aware of the typical pitfalls. However, in research articles, comments on infelicitous items among search hits tend to be mentioned in footnotes, which is quite understandable. At the ICAME conference in Neuchâtel in 2019, I suggested to some colleagues that maybe we could organise a workshop at another conference at some point, with the idea of giving focus to the problematic aspects of corpus use. Whatever small, niggling, persistent troubling item could be examined with the aim of assessing how serious they actually are, and figuring out how we might be able to solve the problems that they pose to corpus users.

Partially based on the presentations at the workshop organised at the 42nd ICAME conference at TU Dortmund University in 2021, a book titled Challenges in Corpus Linguistics: Rethinking Corpus Compilation and Analysis, with me and Marco Schilk as editors, has now been published in the Studies in Corpus Linguistics series by John Benjamins Publishing. It is a funny little book, and we hope that readers will find it interesting. Because of its perspective on the use of linguistic corpora, and its overall goal to improve the state of corpus linguistics, the editors and authors of the book have the rather unusual position in thinking that the sooner its contents become out-of-date, the better!

Maybe there is room for another book project on how to teach corpus linguistics, with reflective views from experienced teachers. As the field has expanded so much over the years, the sheer range of issues to cover has also broadened. And with new technical possibilities that can be applied in teaching, the available pedagogical options could perhaps be studies from some different angles. I’m sure there are one or two angles.