Finite-state morphology with lexc and xfst
This is a first hands-on introduction to developing a finite-state morphological analyzer with Xerox formalisms lexc and xfst. The examples below will use HFST, but they should work (almost without modification) using Foma. Both of these tools provide free implementation of lexc and xfst from Xerox, which was used for building morphological analyzers for many languages.
What to implement
Developing a wide-coverage, general-purpose morphological analyzer is a big undertaking. It often takes months of expert time to construct such analyzers. Here, we will only deal with the basics, with almost no complications at all.
The problem we are working on in this tutorial is to generate/analyze nouns and their plural forms in English. We will work with the following singular-plural pairs.
bat bats
cat cats
cow cows
fox foxes
There are many ways to structure your analyzers. Here, we will define three submodules.
- The root lexicon (
lex
), will only enumerate the root forms - The morphotactics (
morph
) component will add the analysis symbols for the singular and plural we are interested in - The morphophonological (
phon
) rules will transform the analysis form to the surface form.
Our final generator will concatenate lex
and morph
and compose it with phon
. We will keep lex
and morph
as FSA
(FSTs with identity transitions), only using non-identity
transductions in the phon
component.
The idea of building a “generator” rather than an analyzer is very common. Actual components, and how they are combined may vary considerably.
XFST command line
Xerox xfst environment provides a command-line for experimenting
with FSTs built using xfst/lexc.
You can start typing hfst-xfst
on the command line (on Unix-like systems).
Once started there are a number of commands to interact with the system.
We will use only a few, but you can get a list with brief descriptions
using the help
command.
Defining the lexicon
We will start with defining our lexicon, starting with a single “root” form.
regex cat;
view net
The first command regex
defines a regular expression,
compiles it as the current automaton in use.
After the regex
command you should see the number on the command line prompt
in brackets increased (to 1).
To indicate that there is an automaton on the xfst “stack”.
If the support programs are installed (mainly GraphViz and an image viewer), the system will present the following graph:
Although this may look fine at first look, this takes the whole word “cat” as an atomic symbol. We will want to know certain aspects of the words’ orthography. So, we want each character as symbols of our automata.
regex c a t;
This forms cat as concatenation of three letters, and the corresponding automata is:
Another way of obtaining the same result is placing
the words that we want to break into their characters in curly braces.
So the effect of regex {cat}
would be the same.
To add the other words, we will, for now, use the union operator |
to take the union of single-word automata like the one above.
For all the words in our list,
regex {bat}|{cat}|{cow}|{fox};
Should display an automaton like .
Note that shared paths are represented only once.
Morphotactics
Now our root lexicon is ready. We will define it as a “variable” where we can make use of it later. The command
define lex
should define a automata lex
with the current regular expression.
Note that xfst environment removes this from its automata stack
(the number on the command line should decrease by 1).
We can now turn our attention to define our simple morphotactics.
We want to be able to attach two symbols representing the number
feature of the nouns. We will use <sg>
and <pl>
to represent
singular and plural respectively.
Placing analysis symbols in angle brackets is common,
another popular approach is prefixing them with a +
.
The morphotactics automaton is really simple.
We will just take the union of automata defining the above symbols:
define morph %<sg%> | %<pl%>;
The above expression defines the morph
variable directly.
Since the angle brackets are special symbols for xfst,
we escape them using %
.
Another option would be using double quotes like "<sg>"
.
Note that we define <sg>
and <pl>
as atomic symbols,
there is no need for inspecting their individual characters.
If you want to see the resulting automata,
you can use:
regex morph;
Not very interesting, but here is the result for completeness.
.
Now we have both our lexicon and the morphotactics module, we can concatenate them:
regex lex morph;
which, not surprisingly, gives us the following automaton:
Morpho-phonological alternations
The final module we are interested in maps these analysis symbols
to their surface realizations.
The singular case is easy:
we want to map the symbol <sg>
to the empty string.
For plurals, we have a slight complication.
Almost all words require the suffix -s,
except for fox, for which we need -es.
The following implements this as parallel application
of three rules:
define phon "<sg>" -> 0 ,, "<pl>" -> s ,, [..] -> e || x _ "<pl>";
The first rule maps <sg>
to the empty string,
which is represented with 0
in xfst.
The second maps <pl>
to s
, unconditionally.
The third rule uses a special syntax, [..] -> s
, for single insertion,
that applies when the left context is an x
and the right context is <pl>
.
This results in the following FST:
Here, the empty string is represented with 00
,
and the symbol ??
represents any symbol not listed.
You should study this to convince yourself
that it does what it is supposed to do.
Putting all together and testing
Now we can put all the modules together according to our initial plan:
regex [ lex morph ] .o. phon;
We already know that lack of an operator between
two regular expressions is interpreted as concatenation.
The operator .o.
is the composition operator.
The square brackets, unlike in most regex implementations,
mean grouping in xfst.
Here, they are used only for demonstration,
concatenation has higher priority than composition.
So, the result without the brackets is the same.
The resulting network is:
Now our analyzer/generator is complete. We can test the result in the xfst environment. Remember that the above was defined as a generator.
hfst[1]: down cat<sg>
cat
hfst[1]: down cat<pl>
cats
hfst[1]: down fox<sg>
fox
hfst[1]: down fox<pl>
foxes
hfst[1]: down pig<sg>
???
The listing also includes the xfst prompt, we only type the commands
starting with down
, which is a shortcut for apply down
command
running the FST with the analysis string
(upper string in Xerox terminology)
and returning the surface string (lower string).
The analyzer seems to be doing the generation correctly,
and the last item shows a failure, which happens because of the
missing root word pig in the lexicon.
The original xfst implementation (and Foma) also offers an apply up
command.
However, this is not implemented in HFST.
If we want to run our FST in analysis mode, we need to invert it.
Here are a few examples with the analysis:
hfst[1]: invert
? bytes. 11 states, 15 arcs, ? paths
hfst[1]: down cat
cat<sg>
hfst[1]: down cats
cat<pl>
hfst[1]: down fox
fox<sg>
hfst[1]: down foxes
fox<pl>
hfst[1]: down cates
???
hfst[1]: down foxs
???
Note that the analyzer analyzes the correct forms, while rejecting some obvious incorrect forms.
Using lexc
Although in principle we can do everything with xfst,
lexicon and morphotactics can be defined and processed better using lexc.
To do so, we will create a file english.lexc
with the following content.
Multichar_Symbols %<sg%> %<pl%>
LEXICON Root
bat NounNumber;
cat NounNumber;
cow NounNumber;
fox NounNumber;
LEXICON NounNumber
%<sg%> #;
%<pl%> #;
A lexc file is organized as a set of continuation lexicons.
The first lexicon is conventionally called Root
.
Each entry in a lexicon contains a surface part,
and possible continuation lexicon.
The above simple lexicon tells that all of the forms can be “suffixed”
with what is specified in the NounNumber
lexicon.
The special continuation lexicon #
means the word ends here,
it is an accepted form
(of course, we will compose this with our morphophonological alternations).
Unlike in xfst,
we do not need curly braces to expand the words into individual characters.
lexc will do this automatically.
To prevent it, we need to list the symbols with more than one character
using the Multichar_Symbols
keyword.
Now, from xfst, we can run
read lexc english.lexc
This will read the lexc file and compile into an automaton.
The result should be the same as concatenating lex
and morph
above.
Before closing this short tutorial, we organize the lexicon slightly to make it more general.
Multichar_Symbols %<sg%> %<pl%>
LEXICON Root
Noun;
LEXICON Noun
bat NounNumber;
cat NounNumber;
cow NounNumber;
fox NounNumber;
LEXICON NounNumber
%<sg%> #;
%<pl%> #;
The slight modification above allows “calling” other lexicons from Root,
which makes adding other word classes, or in general “inflection classes”.
For example,
to add adjectives, we could list adjectives and their continuation classes
on separate LEXICON
entries, and add to the Root
like Noun
.
Exercise
Irregular plurals
Extend the analyzer above to correctly analyze singular/plural forms of
- finch (plural finches)
- sheep (only sheep should be accepted)
- fish (both fish and fishes should be accepted as plural)
- mouse (plural mice)
- butterfly (butterflies)
Note that there are multiple ways of introducing this semi-irregular words.
Adjectives
Extend the analyzer to analyzer to analyze adjectives and their comparative and superlative forms. The analyzer should produce the following analyses:
- fast -
fast<Adj>
- comfortable -
comfortable<Adj>
- faster -
fast<Adj><Comp>
- fastest -
fast<Adj><Sup>
- later -
late<Adj><Comp>
- easier -
easy<Adj><Comp>
- bigger -
big<Adj><Comp>
But it should not allow multi-syllable adjectives.
For example, it should not analyze comfortabler
and comfortablest
.
Note that we introduced a POS tag symbol (<Adj>
) just after the lemma.
For completeness you should also introduce a similar symbol, <N>
for nouns.