Conditional Random Field (CRF)
4. Features, Feature Types, and Feature Generator
The iitb.Model package contains an implementation of the Feature as
well as the FeatureGenerator interface. It also contains an abstract
FeatureTypes. The package also provides several implemented feature types which
are described below.
The FeatureImpl class is an implementation of the Feature interface.
It encapsulates feature id, current label(yend), previous label(s)(ystart), and a value for the
feature. You would not require to implement this interface unless
your implementation of feature types class requires.
FeatureTypes is a factory class that defines a class of similar kind of features. It
is an abstract class that provides an advanced user of this package the facility
to implement new features. A user is just required to extend this class in order
to add new features.
Following are the important methods of the class:
Few tips regrading how to implement your own FeatureTypes class is given at the end of this section.
In order to make this package easy to use in a
common application, this package provides implementation of some of the commonly
used features such as EdgeFeatures, WordFeatures and the like. Typically, an
application will define a number of features for capturing different properties
of the input data.
A brief description of each of these feature types is given below.
- startScanFeaturesAt(DataSequence data, int prevPos, int pos)
This method is used to instruct the FeatureTypes to generate all the features for label of the token at the position pos in the data sequence data.
The previous position indicates the previous label assigned to the current sequence.
Since, this implementation of the CRF is a one order markov model, a feature can have dependancy on previous label of the current sequence only.
This method initiates generation of the features in given FeatureTypes.
This method is used to check if there are any more features for the current scan intitaited in startScanFeaturesAt().
- next(FeatureImpl f)
An important function of this class, it
basically contains the code for generating the next feature i.e. this function
will contain the code for capturing the characteristic that the user wants
to capture in the form of a feature.
object passed to this function as an argument, will be assigned the values of the
newly generated feature. The features are generated on-the-fly as and when
needed for efficiency reasons.
These are transition features, solely dependent upon current
label and previous 'y' values. The feature encodes transition information and
allows sequential information to be included in the model.
These features fire when the current label in consideration is a start state.
This feature checks whether the current label can be a start
state or not, and fires accordingly.
These features fire when the current label in consideration is an end state.
This feature checks whether the current label can be an end state or not, and
These features simply check whether the current token is
present in the dictionary in the particular state under consideration or not. The dictionary is
created on-the-fly from the training set.
These return log of the ratio of current word with the
label y to the total words with label y.
These features fire when the current token is not observed at
the time of training.
These features fire for those states (labels) where the
token does not appear, but it appears in other states with greater
than RARE_THRESOLD frequency.
A Feature is identified by its name and a unique id assigned to it. Public Class FeatureIdentifier
is used to give an identification to a feature.
In order to generate the WordFeatures, UnknownFeatures, and WordScoreFeatures described above, a dictionary
of words is maintained. This dictionary basically gives a mapping of words seen during training with
their respective class labels. The WordsInTrain class provides this functionality. A brief description of this class is as under.
This class encapsulates dictionary or vocabulary built on-the-fly from the
training set. Basically, it is a set of data-structures which are described
This is a HashMap indexed by word to HEntry; HEntry is inner class
which is described below.
This array is a two dimensional array which stores for each word a count
of the number of occurrences of that word in all the states.
This array stores for each state a total count of all the words appearing in that state as observed during
This is the total count of all words.
HEntry: This class encapsulates information about a single word's count.
It includes 'index' (which is the index of the word in above arrays -- cntsArray,
cntsOverAllWords), 'cnt' (which is the count of the word), and stateArray[numStates]
(which stores the count of the word in each state).
FeatureGenImpl is a feature generator class which implements the FeatureGenerator interface.
As described earlier, a feature generator works as a repository for feature types objects.
These feature types objects are added to the feature generator using the addFeatures() method of the FeatureGenerator interface.
If you have created new features by extending the FeatureTypes class, you will need to add the new feature types in the feature generator.
To add your own feature types, you can extend the FeatureGenImpl
and override the addFeatures() method.
But, if your application needs the functionality not provided by the current
FeatureGenImpl class, you may want to create a new feature generator class by implementing the FeatureGenerator interface.
Typically, you would have one of two types of features to implement.
- A State feature that outputs a boolean feature for a pattern/property for each label. An example is the UnknownFeatures class.
- A transition feature that is a function of not just the current state but also the previous state. An example is the EdgeFeatures class.
Here are a few tips on creating your own state features.
You add different FeatureTypes to the FeatureGenImpl class using the .add() routine as follows:
Create a derived class of FeatureTypes for each class of pattern that you would like to detect. You want a feature to be associated for each label for each pattern in this type. There is a convenient class FeatureTypesEachLabel for outputing a feature for each label. So, you just need to worry about outputing every applicable pattern for each data position given in startScanFeatureAt() on every call to next(). On a next() call you need to fill in the value of the feature (typically 1 but can be any real-value) and set the id and name of the feature. The state where it is fired (the yend field) will be set by the outer FeatureTypesEachLabel wrapper. The id field is the most tricky since you have to make sure to assign a unique id to each distinct pattern that you generate. The name of the feature will be the string pattern that will be useful for viewing later. However, you are responsible for assigning a unique id for each pattern. The id-s you assign do not have to be contiguous and they only have to be unique within each FeatureType. The function setIdentifier is used to assign the id and this function adds a few bits for each FeatureType to make them unique overall. You want the generation of ids to be fast.
FeatureGenImpl fgen = new FeatureGenImpl(modelSpecs, numLabels, false); // last false to disable default addition of features.
fgen.add(new FeatureTypesEachLabel(new MyPatternFeature1());
fgen.add(new FeatureTypesEachLabel(new MyPatternFeature2());
The FeatureTypes class also supports some advanced methods like train() to enable you to collect any statistics you want from the trianing data and use that to drive the features that you output.
Copyright © 2004 KReSIT, IIT Bombay. All rights reserved