Use weka in your java code

The following sections explain how to use them in your own code. A link to an example class can be found at the end of this page, under the Links section. The classifiers and filters always list their options in the Javadoc API (stable, developer version) specification.

A comprehensive source of information is the chapter Using the API of the Weka manual.

Packages#

Initialization#

In order to get your installed Weka packages initialized and also the internal MTJ and arpack libraries added to the classpath, call the loadPackages method of the weka.core.WekaPackageManager class before you instantiate any other classifiers, clusterers, filters, etc:

 import weka.core.WekaPackageManager; . WekaPackageManager.loadPackages(false); 

Management#

You can list all packages using:

 for (Package p: WekaPackageManager.getAllPackages()) System.out.println("- " + p.getName() + "/" + p.getPackageMetaData().get("Version")); 

Currently installed packages using:

 for (Package p: WekaPackageManager.getInstalledPackages()) System.out.println("- " + p.getName() + "/" + p.getPackageMetaData().get("Version")); 

And packages that are available for installation with:

 for (Package p: WekaPackageManager.getAvailablePackages()) System.out.println("- " + p.getName() + "/" + p.getPackageMetaData().get("Version")); 

The following installs the latest version (version parameter is null ) of the alternatingModelTrees package:

 WekaPackageManager.installPackageFromRepository("alternatingModelTrees", null, System.out); 

And this call uninstalls the package:

 WekaPackageManager.uninstallPackage("alternatingModelTrees", true, System.out); 

You can install a package also directly from a URL, e.g.:

 java.net.URL url = new java.net.URL("https://sourceforge.net/projects/weka/files/weka-packages/DilcaDistance1.0.2.zip/download"); WekaPackageManager.installPackageFromURL(url, System.out); 

Instantiation#

For instantiating classes from packages, you can use the forName method of the weka.core.Utils class.

The following example shows how to instantiate the (hypothetical) classifier com.example.FunkyClassifier , which is available from a Weka package that is currently installed:

 import weka.core.Utils; import weka.classifiers.Classifier; . Classifier cls = (Classifier) Utils.forName( Classifier.class, "com.example.FunkyClassifier", new String[]); 

Instances#

Datasets#

The DataSource class is not limited to ARFF files. It can also read CSV files and other formats (basically all file formats that Weka can import via its converters; it uses the file extension to determine the associated loader).

 import weka.core.converters.ConverterUtils.DataSource; . DataSource source = new DataSource("/some/where/data.arff"); Instances data = source.getDataSet(); // setting class attribute if the data format does not provide this information // For example, the XRFF format saves the class attribute information as well if (data.classIndex() == -1) data.setClassIndex(data.numAttributes() - 1); 

Database#

Reading from Databases is slightly more complicated, but still very easy. First, you'll have to modify your DatabaseUtils.props file to reflect your database connection. Suppose you want to connect to a MySQL server that is running on the local machine on the default port 3306 . The MySQL JDBC driver is called Connector/J. (The driver class is org.gjt.mm.mysql.Driver .) The database where your target data resides is called some_database . Since you're only reading, you can use the default user nobody without a password. Your props file must contain the following lines:

 jdbcDriver=org.gjt.mm.mysql.Driver jdbcURL=jdbc:mysql://localhost:3306/some_database 

Secondly, your Java code needs to look like this to load the data from the database:

 import weka.core.Instances; import weka.experiment.InstanceQuery; . InstanceQuery query = new InstanceQuery(); query.setUsername("nobody"); query.setPassword(""); query.setQuery("select * from whatsoever"); // You can declare that your data set is sparse // query.setSparseData(true); Instances data = query.retrieveInstances(); 

Notes: * Don't forget to add the JDBC driver to your CLASSPATH . * For MS Access, you must use the JDBC-ODBC-bridge that is part of a JDK. The Windows databases article explains how to do this. * InstanceQuery automatically converts VARCHAR database columns to NOMINAL attributes, and long TEXT database columns to STRING attributes. So if you use InstanceQuery to do text mining against text that appears in a VARCHAR column, Weka will regard such text as nominal values. Thus it will fail to tokenize and mine that text. Use the NominalToString or StringToNominal filter (package weka.filters.unsupervised.attribute ) to convert the attributes into the correct type.

Option handling#

Weka schemes that implement the weka.core.OptionHandler interface, such as classifiers, clusterers, and filters, offer the following methods for setting and retrieving options:

There are several ways of setting the options:

 String[] options = new String[2]; options[0] = "-R"; options[1] = "1"; 
 String[] options = weka.core.Utils.splitOptions("-R 1"); 
 java OptionsToCode weka.classifiers.functions.SMO 

will generate output like this:

 // create new instance of scheme weka.classifiers.functions.SMO scheme = new weka.classifiers.functions.SMO(); // set options scheme.setOptions(weka.core.Utils.splitOptions("-C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0\"")); 

Also, the OptionTree.java tool allows you to view a nested options string, e.g., used at the command line, as a tree. This can help you spot nesting errors.

Filter#

A filter has two different properties:

Most filters implement the OptionHandler interface, which means you can set the options via a String array, rather than setting them each manually via set -methods. For example, if you want to remove the first attribute of a dataset, you need this filter

 weka.filters.unsupervised.attribute.Remove 

with this option

If you have an Instances object, called data , you can create and apply the filter like this:

 import weka.core.Instances; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove; . String[] options = new String[2]; options[0] = "-R"; // "range" options[1] = "1"; // first attribute Remove remove = new Remove(); // new instance of filter remove.setOptions(options); // set options remove.setInputFormat(data); // inform filter about dataset **AFTER** setting options Instances newData = Filter.useFilter(data, remove); // apply filter 

Filtering on-the-fly#

The FilteredClassifer meta-classifier is an easy way of filtering data on the fly. It removes the necessity of filtering the data before the classifier can be trained. Also, the data need not be passed through the trained filter again at prediction time. The following is an example of using this meta-classifier with the Remove filter and J48 for getting rid of a numeric ID attribute in the data:

 import weka.classifiers.meta.FilteredClassifier; import weka.classifiers.trees.J48; import weka.filters.unsupervised.attribute.Remove; . Instances train = . // from somewhere Instances test = . // from somewhere // filter Remove rm = new Remove(); rm.setAttributeIndices("1"); // remove 1st attribute // classifier J48 j48 = new J48(); j48.setUnpruned(true); // using an unpruned J48 // meta-classifier FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(rm); fc.setClassifier(j48); // train and make predictions fc.buildClassifier(train); for (int i = 0; i  test.numInstances(); i++)  double pred = fc.classifyInstance(test.instance(i)); System.out.print("ID: " + test.instance(i).value(0)); System.out.print(", actual: " + test.classAttribute().value((int) test.instance(i).classValue())); System.out.println(", predicted: " + test.classAttribute().value((int) pred)); > 

Other handy meta-schemes in Weka:

Batch filtering#

On the command line, you can enable a second input/output pair (via -r and -s ) with the -b option, in order to process the second file with the same filter setup as the first one. Necessary, if you're using attribute selection or standardization - otherwise you end up with incompatible datasets. This is done fairly easy, since one initializes the filter only once with the setInputFormat(Instances) method, namely with the training set, and then applies the filter subsequently to the training set and the test set. The following example shows how to apply the Standardize filter to a train and a test set.

 Instances train = . // from somewhere Instances test = . // from somewhere Standardize filter = new Standardize(); filter.setInputFormat(train); // initializing the filter once with training set Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances Instances newTest = Filter.useFilter(test, filter); // create new test set 

Calling conventions#

The setInputFormat(Instances) method always has to be the last call before the filter is applied, e.g., with Filter.useFilter(Instances,Filter) . Why? First, it is the convention for using filters and, secondly, lots of filters generate the header of the output format in the setInputFormat(Instances) method with the currently set options (setting otpions after this call doesn't have any effect any more).

Classification#

The necessary classes can be found in this package:

 weka.classifiers 

Building a Classifier#

Batch#

A Weka classifier is rather simple to train on a given dataset. E.g., we can train an unpruned C4.5 tree algorithm on a given dataset data. The training is done via the buildClassifier(Instances) method.

 import weka.classifiers.trees.J48; . String[] options = new String[1]; options[0] = "-U"; // unpruned tree J48 tree = new J48(); // new instance of tree tree.setOptions(options); // set the options tree.buildClassifier(data); // build classifier 
Incremental#

Classifiers implementing the weka.classifiers.UpdateableClassifier interface can be trained incrementally. This conserves memory, since the data doesn't have to be loaded into memory all at once. See the Javadoc of this interface to see what classifiers are implementing it.

The actual process of training an incremental classifier is fairly simple:

Here is an example using data from a weka.core.converters.ArffLoader to train weka.classifiers.bayes.NaiveBayesUpdateable :

 // load data ArffLoader loader = new ArffLoader(); loader.setFile(new File("/some/where/data.arff")); Instances structure = loader.getStructure(); structure.setClassIndex(structure.numAttributes() - 1); // train NaiveBayes NaiveBayesUpdateable nb = new NaiveBayesUpdateable(); nb.buildClassifier(structure); Instance current; while ((current = loader.getNextInstance(structure)) != null) nb.updateClassifier(current); 

Evaluating#

Cross-validation#

If you only have a training set and no test you might want to evaluate the classifier by using 10 times 10-fold cross-validation. This can be easily done via the Evaluation class. Here we seed the random selection of our folds for the CV with 1. Check out the Evaluation class for more information about the statistics it produces.

 import weka.classifiers.Evaluation; import java.util.Random; . Evaluation eval = new Evaluation(newData); eval.crossValidateModel(tree, newData, 10, new Random(1)); 

Note: The classifier (in our example tree) should not be trained when handed over to the crossValidateModel method. Why? If the classifier does not abide to the Weka convention that a classifier must be re-initialized every time the buildClassifier method is called (in other words: subsequent calls to the buildClassifier method always return the same results), you will get inconsistent and worthless results. The crossValidateModel takes care of training and evaluating the classifier. (It creates a copy of the original classifier that you hand over to the crossValidateModel for each run of the cross-validation.)

Train/test set#

In case you have a dedicated test set, you can train the classifier and then evaluate it on this test set. In the following example, a J48 is instantiated, trained and then evaluated. Some statistics are printed to stdout :

 import weka.core.Instances; import weka.classifiers.Evaluation; import weka.classifiers.trees.J48; . Instances train = . // from somewhere Instances test = . // from somewhere // train classifier Classifier cls = new J48(); cls.buildClassifier(train); // evaluate classifier and print some statistics Evaluation eval = new Evaluation(train); eval.evaluateModel(cls, test); System.out.println(eval.toSummaryString("\nResults\n======\n", false)); 
Statistics#

Some methods for retrieving the results from the evaluation: