Prev | Home | Next |
As described in the previous section, a user of this distribution would require to implement various interfaces defined in the iitb.CRF package of the code. The most basic interfaces are the DataIter and DataSequence interfaces; these are mandatory for a user implement in order to use this package for an application. Optionally, a user may need to implement advanced interfaces like Feature and FeatureGenerator. Note that this section describes interfaces; for detailed API refer the javadoc for the package.
A basic training/test instance would need to support the DataSequence interface. This interface defines a mechanism to access an instance of sequential data encapsulated by the implementing class. You would need to implement this interface for both the training as well as test instances, although you can have single class which can be used for both of them. The implemented class will encapsulate a token (x) sequence and the corresponding label (y) sequence. The interface defines methods to access token (x) and label (y) at a particular position i. An x value in the sequence can be any object (a string or an object of a complex class); CRF does not interpret this x value thus allowing it to be used in a large number of diverse applications. A y value here is an integer. So, if your training set is labeled using any type (say a string label), then you would have to map those labels to integer labels and then create the object of DataSequence using the integer labels. Example implementations of the DataSequence interface are DCTrainRecord and DataRecord which can be found in the Segment and MaxentClassifier modules respectively.
This is an iterator interface for data instances. A class implementing this interface should encapsulate a set of training instances. The object of this class will be passed to the training routine of the CRF package which will use the interface to iterate over the training set encapsulated in the object. Thus, each of your training instances should be read in an object of a class implementing the DataSequence interface, and then an object of the class implementing the DataIter interface would be used to iterate over the created DataSequence objects.
This class captures the basic attributes of a feature; an important one being the feature identifier. The FeatureIdentifier class describes an ID assigned to a feature. The abstract class FeatureTypes (described in detail later) is a base class for creating new features. Whenever a feature is generated, it is assigned an ID by calling the setFeatureIdentifier() method of this class. However, there is a notion of a global feature ID (will be referred to as index from now on), which gives a unique identification to the features. The feature generator (see FeatureGenerator interface) serves the purpose of assigning these indices which are contiguous IDs. If you are creating new features, you must ensure that unique IDs are assigned to the features generated by your application. Besides the FeatureIdentifier, another important attribute is the value of the feature, which basically gives a measure of the importance of the feature as assigned by the user. An implementation of the Feature interface, namely the FeatureImpl class, is present in the iitb.Model package; you can either use this implementation, or implement a new class from scratch.
FeatureGenerator interface is an aggregator over all the feature types.
The feature generator interface is used by iitb.CRF package to access all the features to be used for learning and inference.
Various methods in the interface are described below in brief.
Prev | Home | Next |
Copyright © 2004 KReSIT, IIT Bombay. All rights reserved |