Finite-state morphology with lexc and xfst

This is a first hands-on introduction to developing a finite-state morphological analyzer with Xerox formalisms lexc and xfst. The examples below will use HFST, but they should work (almost without modification) using Foma. Both of these tools provide free implementation of lexc and xfst from Xerox, which was used for building morphological analyzers for many languages.

What to implement

Developing a wide-coverage, general-purpose morphological analyzer is a big undertaking. It often takes months of expert time to construct such analyzers. Here, we will only deal with the basics, with almost no complications at all.

The problem we are working on in this tutorial is to generate/analyze nouns and their plural forms in English. We will work with the following singular-plural pairs.

bat     bats
cat     cats
cow     cows
fox     foxes

There are many ways to structure your analyzers. Here, we will define three submodules.

The root lexicon (lex), will only enumerate the root forms
The morphotactics (morph) component will add the analysis symbols for the singular and plural we are interested in
The morphophonological (phon) rules will transform the analysis form to the surface form.

Our final generator will concatenate lex and morph and compose it with phon. We will keep lex and morph as FSA (FSTs with identity transitions), only using non-identity transductions in the phon component.

The idea of building a “generator” rather than an analyzer is very common. Actual components, and how they are combined may vary considerably.

XFST command line

Xerox xfst environment provides a command-line for experimenting with FSTs built using xfst/lexc. You can start typing hfst-xfst on the command line (on Unix-like systems). Once started there are a number of commands to interact with the system. We will use only a few, but you can get a list with brief descriptions using the help command.

Defining the lexicon

We will start with defining our lexicon, starting with a single “root” form.

regex cat;
view net

The first command regex defines a regular expression, compiles it as the current automaton in use. After the regex command you should see the number on the command line prompt in brackets increased (to 1). To indicate that there is an automaton on the xfst “stack”.

If the support programs are installed (mainly GraphViz and an image viewer), the system will present the following graph:

Single-word lexicon

Although this may look fine at first look, this takes the whole word “cat” as an atomic symbol. We will want to know certain aspects of the words’ orthography. So, we want each character as symbols of our automata.

regex c a t;

This forms cat as concatenation of three letters, and the corresponding automata is:

Single-word lexicon, take 2

Another way of obtaining the same result is placing the words that we want to break into their characters in curly braces. So the effect of regex {cat} would be the same.

To add the other words, we will, for now, use the union operator | to take the union of single-word automata like the one above. For all the words in our list,

regex {bat}|{cat}|{cow}|{fox};

Should display an automaton like Lexicon with all words .

Note that shared paths are represented only once.

Morphotactics

Now our root lexicon is ready. We will define it as a “variable” where we can make use of it later. The command

define lex

should define a automata lex with the current regular expression. Note that xfst environment removes this from its automata stack (the number on the command line should decrease by 1). We can now turn our attention to define our simple morphotactics. We want to be able to attach two symbols representing the number feature of the nouns. We will use <sg> and <pl> to represent singular and plural respectively. Placing analysis symbols in angle brackets is common, another popular approach is prefixing them with a +. The morphotactics automaton is really simple. We will just take the union of automata defining the above symbols:

define morph %<sg%> | %<pl%>;

The above expression defines the morph variable directly. Since the angle brackets are special symbols for xfst, we escape them using %. Another option would be using double quotes like "<sg>". Note that we define <sg> and <pl> as atomic symbols, there is no need for inspecting their individual characters. If you want to see the resulting automata, you can use:

regex morph;

Not very interesting, but here is the result for completeness.

Morphotactics .

Now we have both our lexicon and the morphotactics module, we can concatenate them:

regex lex morph;

which, not surprisingly, gives us the following automaton:

Morphotactics

Morpho-phonological alternations

The final module we are interested in maps these analysis symbols to their surface realizations. The singular case is easy: we want to map the symbol <sg> to the empty string. For plurals, we have a slight complication. Almost all words require the suffix -s, except for fox, for which we need -es. The following implements this as parallel application of three rules:

define phon "<sg>" -> 0 ,, "<pl>" -> s ,, [..] -> e || x _ "<pl>";

The first rule maps <sg> to the empty string, which is represented with 0 in xfst. The second maps <pl> to s, unconditionally. The third rule uses a special syntax, [..] -> s, for single insertion, that applies when the left context is an x and the right context is <pl>. This results in the following FST:

Morphotactics

Here, the empty string is represented with 00, and the symbol ?? represents any symbol not listed. You should study this to convince yourself that it does what it is supposed to do.

Putting all together and testing

Now we can put all the modules together according to our initial plan:

regex [ lex morph ] .o. phon;

We already know that lack of an operator between two regular expressions is interpreted as concatenation. The operator .o. is the composition operator. The square brackets, unlike in most regex implementations, mean grouping in xfst. Here, they are used only for demonstration, concatenation has higher priority than composition. So, the result without the brackets is the same. The resulting network is:

Morphotactics

Now our analyzer/generator is complete. We can test the result in the xfst environment. Remember that the above was defined as a generator.

hfst[1]: down cat<sg>
cat
hfst[1]: down cat<pl>
cats
hfst[1]: down fox<sg>
fox
hfst[1]: down fox<pl>
foxes
hfst[1]: down pig<sg>
???

The listing also includes the xfst prompt, we only type the commands starting with down, which is a shortcut for apply down command running the FST with the analysis string (upper string in Xerox terminology) and returning the surface string (lower string). The analyzer seems to be doing the generation correctly, and the last item shows a failure, which happens because of the missing root word pig in the lexicon.

The original xfst implementation (and Foma) also offers an apply up command. However, this is not implemented in HFST. If we want to run our FST in analysis mode, we need to invert it. Here are a few examples with the analysis:

hfst[1]: invert
? bytes. 11 states, 15 arcs, ? paths
hfst[1]: down cat
cat<sg>
hfst[1]: down cats
cat<pl>
hfst[1]: down fox
fox<sg>
hfst[1]: down foxes
fox<pl>
hfst[1]: down cates
???
hfst[1]: down foxs
???

Note that the analyzer analyzes the correct forms, while rejecting some obvious incorrect forms.

Using lexc

Although in principle we can do everything with xfst, lexicon and morphotactics can be defined and processed better using lexc. To do so, we will create a file english.lexc with the following content.

Multichar_Symbols %<sg%> %<pl%>

LEXICON Root
bat    NounNumber;
cat    NounNumber;
cow    NounNumber;
fox    NounNumber;

LEXICON NounNumber
%<sg%>          #;
%<pl%>          #;

A lexc file is organized as a set of continuation lexicons. The first lexicon is conventionally called Root. Each entry in a lexicon contains a surface part, and possible continuation lexicon. The above simple lexicon tells that all of the forms can be “suffixed” with what is specified in the NounNumber lexicon. The special continuation lexicon # means the word ends here, it is an accepted form (of course, we will compose this with our morphophonological alternations).

Unlike in xfst, we do not need curly braces to expand the words into individual characters. lexc will do this automatically. To prevent it, we need to list the symbols with more than one character using the Multichar_Symbols keyword.

Now, from xfst, we can run

read lexc english.lexc

This will read the lexc file and compile into an automaton. The result should be the same as concatenating lex and morph above.

Before closing this short tutorial, we organize the lexicon slightly to make it more general.

Multichar_Symbols %<sg%> %<pl%>

LEXICON Root
    Noun;

LEXICON Noun
bat    NounNumber;
cat    NounNumber;
cow    NounNumber;
fox    NounNumber;

LEXICON NounNumber
%<sg%>          #;
%<pl%>          #;

The slight modification above allows “calling” other lexicons from Root, which makes adding other word classes, or in general “inflection classes”. For example, to add adjectives, we could list adjectives and their continuation classes on separate LEXICON entries, and add to the Root like Noun.

Exercise

Irregular plurals

Extend the analyzer above to correctly analyze singular/plural forms of

finch (plural finches)
sheep (only sheep should be accepted)
fish (both fish and fishes should be accepted as plural)
mouse (plural mice)
butterfly (butterflies)

Note that there are multiple ways of introducing this semi-irregular words.

Adjectives

Extend the analyzer to analyzer to analyze adjectives and their comparative and superlative forms. The analyzer should produce the following analyses:

fast - fast<Adj>
comfortable - comfortable<Adj>
faster - fast<Adj><Comp>
fastest - fast<Adj><Sup>
later - late<Adj><Comp>
easier - easy<Adj><Comp>
bigger - big<Adj><Comp>

But it should not allow multi-syllable adjectives. For example, it should not analyze comfortabler and comfortablest.

Note that we introduced a POS tag symbol (<Adj>) just after the lemma. For completeness you should also introduce a similar symbol, <N> for nouns.