po polsku

Morphological analyser Morfeusz

Basic concepts

Word is a sequence of letters in a text in the natural language usually separated by either spaces or punctuation marks. Lexeme is an abstract entity of language, a word in the dictionary sense. Word form is a word that has been interpreted — ascribed to a particular lexeme and described as for its grammatical function.

Morphological analysis consists in determining all forms of all lexemes for a particular word, that it is an exponent of. The context, in which the word has appeared, is not taken into consideration in this process. In linguistics, the term morphological analysis refers rather to segmentation of words into elementary morphological components (morphemes), therefore it might be more thoughtful to refer to the process as inflectional analysis. Unfortunately, the former term seems to have settled in the computational linguistics environment.

Morphological disambiguation consists in determination of the form realized by a particular occurrence of a word on the basis of its context.

The sequence of morphological analysis and disambiguation is in jargon referred to as tagging.

The aim of lemmatization is to determine for each text word an entity of a morphological dictionary (lexeme) that describes it. It is therefore a morphological analysis (or tagging) limited only to a part of information on forms — to lemmata.

Approximate lemmatization is sometimes called stemming and consists in depriving the words of the part that is changeable in the process of inflection. This method makes sense for languages with limited inflection but is insufficient in case of Polish. Hence, in the analysis done by Morfeusz we deal with proper lemmatization.

An operation that is inverse to morphological analysis is morphological synthesis — creation of the exponent of the inflectional form given by indicating a lemma (identifier of the lexeme) and a wished inflectional characteristics.

Program Morfeusz

Program Morfeusz carries out a morphological analysis for Polish. The present version is not equipped with a module that guesses unknown words (hence we can say that it is a morphological dictionary).

Here is a sample of the analysis for the text ‘Mam próbkę analizy morfologicznej.’:

0 1 Mam mama [mother] subst:pl:gen:f
mamić [to beguile] impt:sg:sec:imperf
mieć [to have] fin:sg:pri:imperf
1 2 próbkę próbka [sample] subst:sg:acc:f
2 3 analizy analiza [analysis] subst:sg:gen:f
subst:pl:nom.acc.voc:f
3 4 morfologicznej morfologiczny [morphological] adj:sg:gen.dat.loc:f:pos
4 5 . . interp

Each line of the table includes one morphological interpretation, the horizontal lines separate the groups of analysis for particular words. The input text was segmented into words (particularly the full stop was separated from the word ‘morfologicznej’). On the right, corresponding lemmata (entries) were provided. The following column presents tags describing values of grammatical categories of particular forms.

To the word ‘mam’ three interpretations were ascribed: the (genitive) plural form of the noun ‘mama’, the imperative of the verb ‘mamić’ and at last the present form of the verb ‘mieć’. The word ‘analizy’ was unambiguously ascribed to the lemma ‘analiza’ but it can be as well interpreted as both singular and plural form in different grammatical cases.

The tags applied in the program Morfeusz are positional. The first position defines the part of speech, the following ones stand for the values of grammatical categories of each class. For instance, the tag subst stands for a noun, it is followed by the values of the number, case and gender. The tags are usually abbreviated forms of Latin value names. The tags applied are modeled on the tagset characterized in the article Morphological tagset in the IPI PAN corpus published in Polonica XXII/XXIII,2003, pp. 39-55 (in Polish).

Versions of the program Morfeusz

Three versions of linguistic data used with Morfeusz are available:

The data in the Morfeusz SIaT were by nature of an approximate character since the Index compiled by Tokarski and Saloni describes the Polish inflection as a potential. Some inflectional interpretations which were done automatically must be excessive and at some points entirely incorrect. Whereas in SGJP the description of each lexeme was prepared individually, therefore it can be done much more accurately. The dictionary in this version is much more extensive as well. As a consequence, the SIaT version of the program was given up on and these pages concern above all Morfeusz SGJP.

Morfeusz-the-program has two variants. The older (version 1) was in use till 2013, when the program was reimplemented from scratch as Morfeusz 2. The new version is free from known problems of version 1, it has a modern object-oriented C++ interface, it provides additional features of the words being analysed (a classification of proper names and stylistic labels were added), it is also equipped with a synthesis module.

Morfeusz is available in the form of dynamic-link library (compiled for Linux 32/64bit, Windows and Mac OS X/Intel32). The distribution includes a simple command line program that uses the library. The program reads the text from the standard input and puts the outcome on the standard output. A Java based graphical interface (GUI) is also available.

We find more interesting the possibility to use the Morfeusz library in user’s own programs, e.g. written in any of the scripting languages. It enables the outcome of the analyzer to be used in a more flexible way (e.g. the analysis can be easily limited to the mere lemmatization). The modules created by now allow to use Morfeusz in programs written in C/C++, Java, Perl, Python, SWI Prolog, and PHP.

Ostatnie zmiany: 10.05.2016 sgjpol@gmail.com