Pattern-based text classifiers

This thesis proposal regards the exploitation of the top-k patterns extracted (with various methods) from binary data (representing a corpora of documents with associated class label) as a technique for feature extraction, before building an accurate text classifier. See this paper about feature extraction.

More specifically, we want to train a multiclass classifier, using an ensemble of binary ones (for an example, look at the paper).

So, per each class c in the set C of classes, we want separately extract from the training data set the patterns from the records with class c, and from the records within all the other class labels (~c), and then use these patterns to extract the features from the training/test records. This extraction must be done for all the classes $c$, using different patterns/features.

We call P and P' the global set of patterns extracted for c and ~c, respectively.

Every transaction in the training set must be mapped into this global set of patterns, by considering a binary feature for each approximate pattern, indicating its presence/absence in the transaction. Then we can train the binary classifier for the class c.

Note that for each test records, before feeding all binary classifiers in the ensemble, we need to map the record in the different feature/pattern spaces associated with each class. About the corpora for the tests, you can use: http://www.cs.umb.edu/~smimarog/textmining/datasets/index.html. We also need to compare our method with some baseline, i.e. text classifiers proposed in the literature.