Go to the profile of Dammnn

In this article, we’ll look at how to build and evaluate an unsupervised model. We’ll also look at semi-supervised learning, the difference between unsupervised and semi-supervised learning, how to build a semi-supervised model, and how to make predictions using a semi-supervised model.

Working with k-means clustering

Let’s look at how to build a clustering model. We’ll be building an unsupervised model using k-means clustering.

We will use the Instances class and the DataSource class, just as we did in previous chapters. Since we are working with clustering, we will use the weka.clusterers package to import the SimpleKMeans class, as follows:

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.clusterers.SimpleKMeans;

First, we’ll read our ARFF file into a dataset object, and we’ll assign it to an Instances object. Now, since this is all we have to do (in classification we had to also assign the target variable, the class attribute), we have to tell Weka what the class attribute is, then we will create an object for our k-means clustering. First, we have to tell Weka how many clusters we want to create. Let’s suppose that we want to create three clusters. We’ll take our k-means object and set setNumClusters to 3; then, we’ll build our cluster using buildClusterer, and we’ll assign the dataset into which the cluster will be done. Then, we’ll print our model, as follows:

public static void main(String[] args) {
// TODO code application logic here
try{
DataSource src = new DataSource("/Users/admin/Documents/NetBeansProjects/Datasets/weather.arff");
Instances dt = src.getDataSet();
SimpleKMeans model = new SimpleKMeans();
model.setNumClusters(3);
model.buildClusterer(dt);
System.out.println(model);

}
catch(Exception e){
System.out.println(e.getMessage());
}
}

After running it, we will see the following output:

In the preceding screenshot, we can see that, initially, three clusters were created with initial values. After performing the clustering, we get the final three clusters, so that Cluster 0 has 7.0 values, Cluster 1 has 3.0 values, and Cluster 2 has 4.0 values. Since we are not providing a class for our clustering algorithm, the string actually tries to divide similar-looking data into groups (which we call clusters). This is how clustering is done.

Evaluating a clustering model

Now, we’ll look at how to evaluate a clustering model that has been trained. Let’s look at the code and see how this is done.

We’ll be using the following classes:

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.clusterers.SimpleKMeans;
import weka.clusterers.ClusterEvaluation;

We’ll use the ClusterEvaluation class from the weka.clusterers package for evaluation.

First, we will read our dataset into our DataSource object and assign it to the Instances object. Then, we’ll create our k-means object and specify the number of clusters that we want to create. Next, we will train our clustering algorithm using the buildClusterer method; then, we’ll print it using println. This is similar to what you saw earlier:

public static void main(String[] args) {
// TODO code application logic here
try{
DataSource src = new DataSource("/Users/admin/Documents/NetBeansProjects/ClusterEval/weather.arff");
Instances dt = src.getDataSet();
SimpleKMeans model = new SimpleKMeans();
model.setNumClusters(3);
model.buildClusterer(dt);
System.out.println(model);

Next, we’ll create an object for the ClusterEvaluation class. Then, we’ll read in a new test dataset and assign it to our DataSource object. Finally, we’ll take it into the memory by using our Instances object, and we will set the Clusterer model using setClusterer and pass the trained Clusterer object to the setClusterer method. Once we have done this, we will need to evaluate the cluster; so, we will have to pass the test dataset to the evaluateClusterermethod. Then, we will print the resulting strings, so that we can get the number of clusters that we have trained:

ClusterEvaluation eval = new ClusterEvaluation();
DataSource src1 = new DataSource("/Users/admin/Documents/NetBeansProjects/ClusterEval/weather.test.arff");
Instances tdt = src1.getDataSet();
eval.setClusterer(model);
eval.evaluateClusterer(tdt);

Running the preceding code will result in the following output:

We now have the number of clusters, which were printed individually, using our eval object. So, the values for the clusters are as follows: 22% for the first cluster, 33% for the second cluster, and 44% for the third cluster. The total number of clusters is 3.

An introduction to semi-supervised learning

Semi-supervised learning is a class of supervised learning that takes unlabeled data into consideration. If we have a very large amount of data, we most likely want to apply to learn to it. However, training that particular data with supervised learning is a problem because a supervised learning algorithm always requires a target variable: a class that can be assigned to the dataset.

Suppose that we have millions of instances of a particular type of data. Assigning a class to these instances would be a very big problem. Therefore, we’ll take a small set from that particular data and manually tag the data (meaning that we’ll manually provide a class for the data). Once we have done this, we’ll train our model with it, so that we can work with the unlabeled data (because we now have a small set of labelled data, which we created). Typically, a small amount of labelled data is used with a large amount of unlabeled data. Semi-supervised learning falls between supervised and unsupervised learning because we are taking a small amount of data that has been labelled and training our model with it; we are then trying to assign classes by using the trained model on the unlabeled data.

Many machine learning researchers have found that unlabeled data, when used in conjunction with a small amount of labelled data, can produce considerable improvements in learning accuracy. This is how semi-supervised learning works: with a combination of supervised learning and unsupervised learning, wherein we take a very small amount of data, label it, try to classify it, and then try to fit the unlabeled data into the labelled data.

The difference between unsupervised and semi-supervised learning

In this section, we’ll look at the differences between unsupervised learning and semi-supervised learning.

Unsupervised learning develops a model based on unlabeled data, whereas semi-supervised learning employs both labelled and unlabeled data.

We use expected maximization, hierarchical clustering, and k-means clustering algorithms in unsupervised learning, whereas in semi-supervised learning, we apply either active learning or bootstrapping algorithms.

In Weka, we can perform semi-supervised learning using the collective-classification package. We will look at installing the collective-classification package later in this chapter, and you’ll see how you can perform semi-supervised learning using collective classification.

Self-training and co-training machine learning models

The very first thing that we’ll do is download a package for semi-supervised learning, then we will create a classifier for a semi-supervised model.

Downloading a semi-supervised package.

Go to https://github.com/fracpete/collective-classification-weka-package (it’s an old file which I found on github, but it will work with our model)to get the collective-classificationWeka package. This is a semi-supervised learning package that is available in Weka.

There are two ways to install the package, as follows:

  • Download the source from GitHub and compile it, then create a JAR file
  • Go to the Weka package manager, and install the collective classification from there

After performing one of the preceding methods, you’ll have a JAR file. You will need this JAR file to train the classifier. The source code that we’ll be getting will provide the JAR file with the code. Let’s look at how this is done.

Creating a classifier for semi-supervised models

Let’s start with the following code:

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.collective.functions.LLGC;

The very first things that we need are the Instances and DataSource classes, which we have been using since the beginning. The third class that we need is an LLGC class, which is available in the functions package of the collective-classification JAR file.

Therefore, we need to import two JAR files into the project; one is the conventional weka.jar file that we have already been using, and the second one is the semi-supervised learning file, the collective-classification-<date>.jar file, as seen in the following screenshot:

Now, we will create a DataSource object, and we’ll assign our ARFF file to the DataSource object, as follows:

try{
DataSource src = new DataSource("weather.arff");
Instances dt = src.getDataSet();
dt.setClassIndex(dt.numAttributes()-1);
 LLGC model = new LLGC();
model.buildClassifier(dt);
System.out.println(model.getCapabilities());
}
catch(Exception e){
System.out.println("Error!!!!\n" + e.getMessage());
}

Then, we will create an Instances object, and we will assign the ARFF file to this Instances object and get our data into the memory. Once our dataset is available in the memory, we’ll tell Weka which attribute is the class attribute that we have been using in the classification. Next, we will initialize the LLGC object. LLGC is a class for performing semi-supervised learning. We will use model.buildClassifier(dt), and we will print the capabilities of the classifier.

The capabilities will be printed, as shown in the following screenshot:

As you can see in the preceding screenshot, these are the attributes that the LLGC class can perform the semi-supervised learning on, in order to build a model. This is how we will build a semi-supervised model.

Making predictions with semi-supervised machine learning models

Now, we’ll look into how to make predictions using our trained model. Consider the following code:

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.collective.functions.LLGC;
import weka.classifiers.collective.evaluation.Evaluation;

We will be importing two JAR libraries, as follows:

  • The weka.jar library
  • The collective-classification-<date>.jar library

Therefore, we will take the two base classes, Instances and DataSource, and we will use the LLGC class (since we have trained our model using LLGC) from the collective-classificationspackage, as well as the Evaluation class from the collective-classifications package.

We will first assign an ARFF file to our DataSource object; we’ll read it into the memory, in an Instances object. We’ll assign a class attribute to our Instances object, and then, we will build our model:

public static void main(String[] args) {
try{
DataSource src = new DataSource("weather.arff");
Instances dt = src.getDataSet();
dt.setClassIndex(dt.numAttributes()-1);
 LLGC model = new LLGC();
model.buildClassifier(dt);
System.out.println(model.getCapabilities());
 Evaluation eval = new Evaluation(dt);
DataSource src1 = new DataSource("weather.test.arff");
Instances tdt = src1.getDataSet();
tdt.setClassIndex(tdt.numAttributes()-1);
eval.evaluateModel(model, tdt);
 System.out.println(eval.toSummaryString("Evaluation results:\n", false));
 System.out.println("Correct % = "+eval.pctCorrect());
System.out.println("Incorrect % = "+eval.pctIncorrect());
System.out.println("AUC = "+eval.areaUnderROC(1));
System.out.println("kappa = "+eval.kappa());
System.out.println("MAE = "+eval.meanAbsoluteError());
System.out.println("RMSE = "+eval.rootMeanSquaredError());
System.out.println("RAE = "+eval.relativeAbsoluteError());
System.out.println("RRSE = "+eval.rootRelativeSquaredError());
System.out.println("Precision = "+eval.precision(1));
System.out.println("Recall = "+eval.recall(1));
System.out.println("fMeasure = "+eval.fMeasure(1));
System.out.println("Error Rate = "+eval.errorRate());
//the confusion matrix
System.out.println(eval.toMatrixString("=== Overall Confusion Matrix ===\n"));

}
catch(Exception e)
{
System.out.println("Error!!!!\n" + e.getMessage());
}
}

Once we have done this, we will create an object for our Evaluation class, and we’ll specify which dataset we want to perform the evaluation on. Hence, we’ll pass our dataset to the Evaluation class constructor. Then, we will create a new object for the DataSource class, and we will take the weather.test.arff file for testing. We will create an Instances object, tdt, and assign the dataset to the test dataset, tdt.

Then, we will need to inform Weka which attributes in the tdt object is our class attribute; therefore, we will call the setClassIndex method. Then, we will use the evaluateModel method of our Evaluation class and pass in the model and our test dataset.

Once this is done, we will print the Evaluation results all at once; or, if you want, you can individually print the results, as we did in the semi-supervised learning exercise.

Let’s run the code. We will get the following output:

Our model was built successfully. Once the model was built, we printed the entire results, and then we individually printed the results and the confusion matrix. This is how a model is built with semi-supervised data.

Source: Artificial Intelligence on Medium