Natural Language Processing for Teaching Ancient Languages (cid:63)

. In recent years, Artiﬁcial Intelligence, esp. Natural Language Processing, has come into the focus of public attention, e.g. by generating extremely realistic, human-like texts. In this paper, we will show various applications of language processing to the ﬁeld of Classics, especially to Latin texts. Diﬀerent levels of linguistic analysis will be addressed while also highlighting educational beneﬁts and important theoretical pitfalls. In particular, we will show how vocabulary can be assessed and improved through interactive exercises, how treebanks can serve as a basis for Keyword in Context views and how various cultural concepts of antiquity can be conveyed with the aid of distributional semantic models.


Natural Languages
The notion of natural language refers to the challenge of analyzing human communication. We hope to gain insight into our everyday interactions (Crocker 2013, p. 482) by looking at our communicative behavior. For this task, languages like English, Farsi or Ancient Greek are presumably more informative than programming languages like C or Python because the latter rarely serve a strictly communicative purpose, but rather process data for a specific workflow. To achieve that goal, machines usually rely on their own computational capacity, consulting external resources only if explicitly asked to do so. Humans, on the other hand, tend to use joint reasoning to solve advanced problems (Textor 2011, p. 44).
When we decide to use human language as a research object, we may encounter a few problems that are not present in constructed languages: They evolve continuously (Ljunglöf et al. 2010, p. 60) and include lots of ambiguity or uncertainty (Palmer 2010, p. 15). For the speakers, such evolution and vagueness is desirable to retain a sufficient amount of flexibility, which is needed in dynamic environments where social interaction is neither rigid nor perfectly consistent.
For machines, however, this constant interpretative performance of humans has to be emulated artificially (Ljunglöf et al. 2010, p. 59;Palmer 2010, p. 9). Depending on our end goal, we often need several steps to decode the meaning of, e.g., an ancient text (see Fig. 1). Each of those steps plays an important role in providing the necessary information for a machine to decode a given linguistic input. Current systems for Natural Language Processing (NLP) are quite proficient at analyzing various lexical aspects of texts in most languages. Syntax, on the other hand, is much harder to analyze, especially for languages where the availability of high-quality research data is quite limited (McGillivray 2013, p. 3;Ragni et al. 2014;Karakanta et al. 2018, p. 168). Unfortunately, this is true for most ancient cultures. Even for Latin, where a rich tradition survived, the majority of written evidence is lost irrecoverably. Therefore, NLP applications for such problematic languages currently cover only a limited amount of syntactic or semantic analysis, let alone pragmatics. In the following, we shall look at several examples of what works and what does not.

Finding the Right Text
Vocabulary is a crucial aspect of teaching ancient languages, which is why there have been ongoing efforts to determine a certain amount of words for a core vocabulary (Utz 2000, p. 146;Jones et al. 2006;Robillard et al. 2014, p. 2). Traditionally, it was assumed that such a basic vocabulary should be acquired in an initial learning phase that is then followed by a phase of extensive reading of ancient literature (Freie und Hansestadt Hamburg, Behörde für Bildung und Sport 2004, p. 10). Nowadays, however, longer learning processes are emphasized more strongly, even for historical languages that we only learn at schools or universities (Foley et al. 2017). What we actually want is to measure the lexical knowledge for a given learner at any point time, which has been notoriously difficult for humans and machines alike (Chen 2011, 292f. Dorça et al. 2013, p. 2094Munser-Kiefer et al. 2018, p. 115;Beyer 2018, p. 13).
Once we know more about a learner's current lexical proficiency, yet another challenge awaits us: What is a suitable text passage or exercise for that person in order to further advance in the path of language learning? Basic operationalizations for this task include comparing a list of supposedly known words to a list of lemmata that occur in a text. Similar to computational models (Parada et al. 2010, p.57), learners will often struggle to deal with lemmata that go beyond their available vocabulary. Teachers therefore usually want to provide additional help, e.g. in the form of explanatory contexts, glosses, dictionaries or simple translations. Unfortunately, such countermeasures may not work consistently because we often do not know the actual degree of (un-)familiarity for a given lexeme and learner. Since vocabulary knowledge is multidimensional (González-Fernández et al. 2019, p. 3), computational operationalizations of lexical progression need to incorporate more than just the binary decision of "known/unknown". This also applies to diagnosis and feedback, where the dichotomy "correct/incorrect" is often too simple. What NLP (and research on vocabulary acquisition in general) needs is a consistent model for providing information on various error types, forms of knowledge and understanding of tasks or instructions (Narciss 2008, p. 135). A basic starting point towards this goal is to apply an extensive metadata schema for exercises that separately encodes the types of interaction, linguistic phenomena and possible embedding in a larger progression.
Assuming we successfully identified certain words to be learned and found text passages in which these occur, the next step would be to create appropriate exercises for this material. Traditionally, vocabulary has been acquired by memorizing lists of word equations in the form "Latin word = L1 word" (Carter et al. 1997, p. 2). More recent approaches, on the other hand, have emphasized vocabulary acquisition in contexts rather than isolated word forms (Waiblinger 2001, p. 160;Webb 2008, p. 238;Nation 2012, p. 353). Ideally, such contexts should contain authentic rather than artificial utterances (Römer 2009, p. 93;Tok 2010, p. 509) to avoid an oversimplification of language that would lead to a shock for learners later on when they are suddenly confronted with real-world texts (Schibel 2013, p. 115). In this regard, NLP can be employed to make use of authentic text corpora to create contextualized vocabulary exercises. At the very least, this means to have pairs of words, e.g. nouns and their adjectival modifiers (see Fig. 2. Other setups may turn out to be even more effective, e.g. Keyword in Context views (Helm 2009, p. 97) or clozes as a means of differentiating between similar conjunctions (see Fig. 3).  Using authentic language as a basis for the exercises has the added benefit of implicitly confronting learners with many linguistic patterns and structures, e.g. in syntax or lexis. This way, even if the focus of the exercise is on a very specific phenomenon, they will also internalize many other properties of the target language that are not emphasized separately (Röhr-Sendlmeier et al. 2012, p. 45). Besides, using an interactive digital system to communicate such materials can be more inclusive, more motivating and conducive to a deeper understanding of word meaning (Crossley et al. 2010, p. 71;E. C. Schmid 2010, pp. 165-169;Harecker et al. 2011, pp. 1-5). Furthermore, automatic evaluation of exercises enables us to provide ongoing formative, visualized feedback and to construct individual learning paths for each person (Chen 2011, 292f. Ferguson 2012Univio et al. 2019, p. 158). This may well be one of the most important benefits of NLP for language teaching.

Treebanks and Learners' Expectations
When learners acquire vocabulary, they do not just learn about the lexis, i.e. when to use which word. In fact, they also need to grasp the word's meaning and position in various contexts. In this view, the theoretical distinction between lexicon, syntax and semantics becomes blurry, or at least highly interwoven (Rich et al. 1991, p. 410;Aijmer 2009, p. 3;Rei et al. 2014, p. 75;Lehecka 2015, p. 6;Lebani et al. 2018, p. 133). Considering that there cannot be a full understanding of any word in a specific context without knowledge of its syntactic function (Gries et al. 2013, p. 348;H.-J. Schmid et al. 2013, p. 551), we need to make sure that advanced learners' expectations, when reading the beginning of a sentence, have been shaped and trained by as many similar contexts as possible. For beginners, on the other hand, the text passages that they are confronted with have to be chosen in a way that they do not match their expectations perfectly. In this manner, they will be forced to adapt their mental representation of syntagmatic structures in the target language, thus extending their knowledge (Ellis 2008, p. 374;Farmer et al. 2011Farmer et al. , p. 2059Hahn et al. 2019, p. 14).
Historically, such experiential modifications of linguistic knowledge have been exemplified in grammar books, which often impose rather prescriptive standards and use several authentic instances of language use to support their claims, followed by a few exceptions where the general rule does not apply (Menge 1914, p. 334). With the advent of curated text corpora of decent size even for historical languages, however, we may now replace the textual basis from which we deduce linguistic assumptions with suitable ad hoc corpora. Treebanks, i.e. syntactically annotated text collections (often including multiple authors), should be used both as a standard reference for the target language in general and as a pool for extracting information about specific subcorpora, e.g. all works of a certain author. If that author's works are to be read in school, teachers can access the relevant treebank (Bamman et al. 2011) through dedicated corpus search tools (Krause et al. 2016) and see which constructions are particularly important to understand chosen the texts. Furthermore, educational publishing companies may choose to base their next textbook's vocabulary only on those texts that are part of the curriculum at later stages.

Modeling the Meaning of Words
While NLP practitioners have access to more and more lexical and syntactic resources to provide teachers with useful materials, the same cannot be readily said about semantics. There have been efforts to create expert databases (Fellbaum et al. 2012, p. 315) that are supposed to represent human semantic knowledge. Unfortunately, these are often built from personal intuition rather than empirical evidence. One of the most promising approaches for overcoming this problem is distributional semantics, which defines a word's meaning by looking at its surrounding context (Harris 1954, p. 162;Firth 1957, p. 30). It has been on the rise in recent years, especially because of the Deep Learning hype (Lin 2019). Furthermore, good distributional semantic models (DSMs) do not just represent semantic relations, but also morphological or even pragmatic information (Gries et al. 2009, p. 59;Gladkova et al. 2016, p. 8), which is not surprising given the interwovenness of the various linguistic levels that was mentioned above.
Obviously, such approaches suffer from many problems: -They tend to ignore common knowledge that is available to every human, but is never mentioned in the given texts (Bruni et al. 2014, p. 3). -They struggle to adequately represent rare or metaphoric word usage (Grigonytė et al. 2010, p. 404). -Their inferential power (e.g. by analogy) is very case-specific and cannot be easily generalized (Rogers et al. 2017, 142f.). -They often do not model polysemy at all and do not differentiate between, e.g., synonymy and syntagmatic relatedness (Karan et al. 2012, 114f. Faruqui et al. 2016).
Fortunately, these points have attracted attention and lead to serious improvements, especially concerning the modeling of polysemy (Hamilton et al. 2016, p. 8). A lot of research has been done on using textual context as a source of information on a word's meaning, e.g. by hiding words in a text and making a machine fill the blanks correctly (Devlin et al. 2018) or by systematically comparing various computational operationalizations of linguistic knowledge (Dobó 2019, p. 85). Thereby, standard procedures in philology, such as finding semantically related words for a given topic in a given text corpus (Cordes 2020, p. 43), can be facilitated through machine learning output that is interactively visualized as a network (see Fig. 4). In such networks, users can start from a single word (veritas) and quickly expand on that to find other related terms like simulatio (pretense), crederet (to trust) or suggerendis (to suggest). In this sense, the procedure is comparable to snowball sampling (Handcock et al. 2011, p. 368) because once a user has found these additional terms, each one of them can be used as the basis for another search. Such search methods have been used with traditional linguistic resources as well (e.g. dictionaries, catalogues of synonyms etc.), but they have rarely been adapted to a specific researcher's target data. Using a dynamic machine learning approach enables NLP software to apply the general method (i.e. extracting word fields from a text) to almost any given corpus. The most important obstacle, then, will be to make the base architecture useful for as many cases as possible. This way, we can optimize the method for many different usage scenarios at the same time, instead of starting from scratch for every new text corpus.
Methodologically speaking, snowball sampling is applicable to entire sentences and documents as well (see Fig. 5). Since the amount of linguistic output here is considerably larger, users will probably not interact with it in the same way as with the network. Instead, such lists can be seen as a semantic equivalent of Keyword in Context views, i.e. depicting descriptions of the same target entity (here: factuality as designated by the input word vera) in various contexts. This is especially useful for teachers who do not want to convey the meaning of one specific term, but rather of an entire concept or topic, which in turn is often essential to a deep understanding of ancient texts.
Another important aspect resides in the explainability of NLP results (Doran et al. 2017, 4f.). Especially in the case of semantics, finding relevant responses to a given query often includes complex modeling (Divjak et al. 2009, p. 274;Weale et al. 2009, p. 29), specific statistical measures (Hagiwara et al. 2009, p. 566) or searching in additional resources (Ono et al. 2015, pp. 984-988). Usually, explainability of the underlying algorithms is less important for teachers than for researchers, but from an epistemological and educational perspective, a lot is to be learned by applying introspection to the decision processes of Artificial Intelligence. This becomes apparent in cases where machines produce convincing visualizations using improper modeling. A good example is the stylometric study of Ochab et al. 2019, in which a seemingly simple decision process (i.e. authorship attribution) is made more complicated by many confounding variables, such as text length or topic (Golcher et al. 2011, pp. 31-33;Ochab et al. 2019, p. 141).

Conclusion
In the end, teachers have to be aware that NLP can solve some problems better than others. It is quite suitable to train learners' vocabulary or even syntactic expectations by confronting them with interactive, individualized exercises and materials that have been tailored to their current state of knowledge. However, a teacher cannot rely solely on software because human domain-specific expertise and social sensitivity are needed to provide elaborate advanced feedback to the students. Moreover, machines may retrieve and visualize relevant search results very efficiently, but they must not take interpretation and decision-making off the learners' hands because those processes constitute the core of consensus negotiation and, thus, knowledge acquisition. Besides, even in the easy cases, one always has to be aware of risks such as systematic bias, weak statistical measures or overly suggestive visualizations. Apart from such implicit or hidden weaknesses, some tasks are known to be too difficult for machines at the moment, e.g. reliable and highly accurate parsing of syntax for historical languages. Fortunately, 'at the moment' in this case is arguably a question of a few years rather than decades. Other, more complex tasks like Word Sense Disambiguation may need considerably more time to meet a similar milestone.