Apache OpenNLP - Document Classification

Apache OpenNLP is a library for natural language processing using machine learning. For getting started on apache OpenNLP and its license details refer in our previous article.
In this article, we will explore document / text classification by training with sample data and then execute to get its results. We will use plain training model as one example and then use Navie Bayes Algorithm to train the model.
Document Categorizer
We going to classify documents based on its License. To do that, We need to first prepare a training file which will have information related to software license. In our example, For our example, we just took two variants of license - BSD, GNU. Create a model by parsing tokens and finding the feature vectors with exact likelihood (cutoff params = 0). The quality and content of training data is important as based on this OpenNLP will be able to categorize the documents and we will be able to reduce false positives.
The training file will be provided to MarkableFileInputStreamFactory which will prepare the document sample stream and the stream will be passed as input to DocumentCategorierME class which is primarily responsible to train model by doing 100 iterations to get the exact likehood of finding the category.
Once trained, it will return the document category model. The model will be serialized to the object binary file. Saving the trained model is helpful as in future we can use the pre-trainded model directly or we can also further train the model with new data set.
In our example, we take input of the document to be classified from the console. User has to type in their content which will classified and software license category will be identified using the trained model. Ideal in the production use, we will be getting the new documents from different data source.
Tokenization: It is a process of breaking the sentence in to words based on the delimiter which is mostly whitespace. Below is the example of breaking the text in to tokens. Below sample uses tokenizer model "en-token.bin" and to generate "en-token.bin", refer to our previous article.
Program for training for the license category:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.Scanner;
import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.doccat.FeatureGenerator;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
import opennlp.tools.util.model.ModelUtil;
public class OpenNLPDocumentCategorizerExample {
public static void main(String[] args) throws Exception {
/* Read human understandable data & train a model */
/* Read file with classifications samples of sentences. */
InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("licensecategory.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
/*
Use CUT_OFF as zero since we will use very few samples.
few samples, each feature/word will have small counts,
so it won't meet high cutoff.
*/
TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
params.put(TrainingParameters.CUTOFF_PARAM, 0);
DoccatFactory factory = new DoccatFactory(new FeatureGenerator[] { new BagOfWordsFeatureGenerator() });
/* Train a model with classifications from above file. */
DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory);
/*
Serialize model to some file so that next time we don't have to again train a
model. Next time We can just load this file directly into model.
*/
model.serialize(new File("documentcategorizer.bin"));
/**
* Load model from serialized file & lets categorize reviews.
*
* Load serialized trained model
*/
try (InputStream modelIn = new FileInputStream("documentcategorizer.bin");
Scanner scanner = new Scanner(System.in);) {
while (true) {
/* Get inputs in loop */
System.out.println("Enter a sentence:");
/* Initialize document categorizer tool */
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
/* Get the probabilities of all outcome i.e. positive & negative */
double[] probabilitiesOfOutcomes = myCategorizer.categorize(getTokens(scanner.nextLine()));
/* Get name of category which had high probability */
String category = myCategorizer.getBestCategory(probabilitiesOfOutcomes);
System.out.println("Category: " + category);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
/**
* Tokenize sentence into tokens.
*
* @param sentence
* @return
*/
private static String[] getTokens(String sentence) {
/*
OpenNLPDocumentCategorizerExample.class.getResourceAsStream("en-token.bin");
Use model that was created in earlier tokenizer tutorial
*/
String fileURL = OpenNLPDocumentCategorizerExample.class.getResource("/models/en-token.bin").getPath();
try (InputStream modelIn = new FileInputStream(new File(fileURL))) {
TokenizerME myCategorizer = new TokenizerME(new TokenizerModel(modelIn));
String[] tokens = myCategorizer.tokenize(sentence);
for (String t : tokens) {
System.out.println("Tokens: " + t);
}
return tokens;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
}
Output:
Enter a sentence:
C Shell is a 2BSD release of the Berkeley Software Distribution (BSD).
Tokens: C
Tokens: Shell
Tokens: is
Tokens: a
Tokens: 2BSD
Tokens: release
Tokens: of
Tokens: the
Tokens: Berkeley
Tokens: Software
Tokens: Distribution
Tokens: (
Tokens: BSD
Tokens: )
Tokens: .
Category: BSD
Enter a sentence:
Naive Bayes Classifier
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is a probabilistic classifier and well suited for supervised learning. Advantage of using Naive Bayes model requires small amount of training data and classify based on the maximum likelihood.
Movie Classification PreBuilt Model
OpenNLP has training parameters class flexibility where we can provide the algorithm as Navie Bayes algorithm with exact match and having options for running the iterations for 10 times. A training file (en-movie-category-train present in github) which has the genre category and then followed by the movie description. It will be huge file so training will be more efficient and then while execution, the chances of successfully finding it will be more. Sample file has been made as plain text stream with document sample stream.
DocumentCategorieserME class constructor accepts the inputs as language, training parameters, training file and then the document category factory object. Document category factory will be used to create the new document categorizer model which will be returned as part of the train method.
Model serialized to the temporary file as object bin file. This bin file can be opened as document categorizer object and which will find the probability of the text content for each category. The one with the maximum probability will be choosen as the best category and displayed.
package com.nagappans.apachenlp;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizer;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.ml.AbstractTrainer;
import opennlp.tools.ml.naivebayes.NaiveBayesTrainer;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
/**
* OpenNLP version 1.7.2
* Training of Document Categorizer using Naive Bayes Algorithm in OpenNLP for Document Classification
*
*/
public class DocClassificationNaiveBayesTrainer {
public static void main(String[] args) throws Exception{
try {
/* read the training data */
InputStreamFactory dataIn =
new MarkableFileInputStreamFactory(
new File(DocClassificationNaiveBayesTrainer.class.getResource(
"/models/en-movie-category" + ".train").getFile()));
ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream sampleStream = new DocumentSampleStream(lineStream);
/* define the training parameters */
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 10+"");
params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
/* create a model from training data */
DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, new DoccatFactory());
System.out.println("Model is successfully trained.");
File trainedFile = File.createTempFile(
DocClassificationNaiveBayesTrainer.class.getResource(
"/models/").toURI().getPath() ,"en-movie-classifier-naive-bayes" + ".bin");
/* save the model to local */
BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(trainedFile));
model.serialize(modelOut);
System.out.println("Trained Model is saved locally at : " +
"models" + File.separator + "en-movie-classifier" + "-naive-bayes.bin");
/* Test the model file by subjecting it to prediction */
DocumentCategorizer docCategorizer = new DocumentCategorizerME(model);
String[] docWords = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold".replaceAll("[^A-Za-z]", " ").split(" ");
double[] aProbs = docCategorizer.categorize(docWords);
/* print the probabilities of the categories */
System.out.println("---------------------------------\nCategory : Probability\n---------------------------------");
for (int i=0;i<docCategorizer.getNumberOfCategories();i++){
System.out.println(docCategorizer.getCategory(i)+" : " + aProbs[i] );
}
System.out.println("---------------------------------");
System.out.println("\n"+docCategorizer.getBestCategory(aProbs)+" : is the predicted category for the given sentence.");
}
catch (IOException e) {
System.out.println("An exception in reading the training file. Please check.");
e.printStackTrace();
}
}
}
Reference:
Source code - https://github.com/nagappan080810/apache_opennlp_workouts.git