Computerlinguistische Methoden für die Digital Humanities. Eine Einführung für Geisteswissenschaftler:innen
by Melanie Andresen
While I was studying Applied Linguistics at university in Hamburg back in the 1990s, the academic subject of computer linguistics was just developing. I touched on the interdisciplinary field of computing and foreign language learning and teaching in some of my courses and later went on to write my Master's thesis about acquiring foreign-language vocabulary using software packages currently on the German market.
In other courses, we analysed language textbooks by using a digital corpus we created ourselves and rudimentary commands written in Turbo Pascal, a programming language popular at the time. This simple computer analysis of the contents and structure of language textbooks was essentially a practical, hands-on introduction to corpus linguistics in the language education sector.
A lot has happened in educational computing since then and a host of studies have been carried out based on digital corpuses of written language from sources like novels or news magazines. Lexicographers (dictionary compilers) built up huge databases of pre-existing linguistic data, most of which was in written form rather than spoken, and analysed these by computer to obtain information about the various senses of a word, its frequency and the contexts in which it was used in the corpus, for example. The Collins COBUILD English Language Dictionary project was an early case in point from the 1980s and led to some very useful material being produced for language learners and teachers alike.
Some of the methods employed in corpus linguistics are also used in the field of computer linguistics, so there is an overlap to an extent. Melanie Andresen points this out in the introduction to her recent book on methods used in the digital humanities today, explaining how the two fields are linked and how they differ (see pp. 14-16). In a nutshell, while corpus linguistics describes the use of language by analysing corpora and looking for linguistic patterns, computer linguistics tries to model languages on computers in order to find technical solutions to practical problems (p. 16, quoting T. McEnery & A. Hardie, Corpus Linguistics, Cambridge University Press, 2012, p. 228).
Computer linguists analyse language corpora for a number of reasons, one of which is to learn how to simulate speech on a computer and get it to produce helpful responses to things that users say. So one aim is to create chatbots and other 'interactive assistants' that can converse with customers in a naturally sounding way and help them achieve what they want to do (finding out how to get somewhere while they are driving, say).
Melanie Andresen's book is aimed at newcomers to the subject who have a background in the humanities and who presumably want to know how computational methods from another field can help them analyse humanistic data in new ways, gaining insights they would not have discovered otherwise by taking a traditional approach. She begins by outlining the basic concepts of linguistics of relevance to computer linguists, such as lexical aspects of language, syntax, semantics and pragmatics (chapters 2 to 8 ). The second part of the book concerns specific methods, such as corpus searches, manual annotation, machine learning and deep learning (chapters 9 to 12). These last two chapters will be of particular interest to translators who use CAT tools or online tools with an MT (machine-translation) component like Linguee/DeepL.
The third and final section of the book is called 'Gesellschaft' (society) and explores the issue of ethics in computer linguistics, or rather, artificial intelligence. Spanning 12 pages (in the first edition of the book), this part discusses issues such as dual uses of technology (i.e. conflicting uses), unclear authorship and 'author profiling' using metadata about people who have provided some of the linguistic data in the corpus (section 13.2). Other topics she goes into are bias, discrimination and (lack of) representation in the data, which all have a negative effect on the computer's linguistic output.
I'm still reading this interesting book at the moment, but one thing that's struck me already is that the author focuses on explaining what computer linguistics is rather than giving her readers an idea of how it could be employed in their own area of the humanities. (The digital dictionary examples above are my own, not hers.) How might it be employed by historians, for example? Some practical suggestions and pointers would have been helpful, I'm sure.
Still, Andresen's introduction to the subject is well written and certainly gives the reader some insights into a new interdisciplinary field that is expanding rapidly. Here's a link to the book on the publisher's website. You can order a printed copy of it online or from any bookshop (Narr Francke Attempto Verlag, 2024, €26.99 for the softcover version, €21.99 for the e-book).
Enjoy the read.
Carl
Friday, 8 August 2025
Comments