Iris Flower Species Prediction using Semi-Supervised Classification
As an assistant professor of botany at Washington University, Edgar Anderson was interested in researching phenotypical variations found in flowers, specifically of the Iris genus. He hypothesized that such slight variations in measurements found within the Iris versicolor species were indicative of a new, uncovered species. To test his theory, he traveled to the Gaspe` peninsula to pick flowers “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus” [1]. In addition to picking Iris versicolor, he also collected a similar species from the same genus, Iris setosa, in order to compare differences in measurements. Indeed, thanks to this work, he discovered the existence of a new species, Iris virginica, previously mistaken by the scientific community as all being one species.
This dataset has since become a popular tool to test various classification techniques in the field of machine learning and statistical analysis. What makes this dataset particularly interesting is that the most commonly used modern, unsupervised inference techniques perform exceptionally poorly when applied onto this dataset. The reason for this is the same reason for which the scientific community previously mistook versicolor and virginica as all being the same species of flower. The following plot (showing setosa in cyan, versicolor in pink, and viriginca in gray) helps visualize why distinguishing these two species (without predefined labels) is particularly difficult.
In this article, we want to investigate whether an accurate “middle-ground” technique can be exist between unsupervised and supervised classification. To this end, we will remove some of the labels attributed to the given data points, and split the set into labeled and unlabeled data, using a 75-25 split. To acquire the dataset, we will import the ‘sklearn’ python package.
dataset = sklearn.datasets.load_iris() x, y = dataset['data'], dataset['target'] X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.25))
After splitting the data, we next need to program a few functions which will be needed to carry out our classification task for the data points within X_test. To do this, we need to first think of a way how we will be performing this classification. Each datapoint consists of four features, namely sepal length/width and petal length/width. We can see in the above scatter plot that although the three clusters share some overlap, the purity of the almost all clusters remains well over 50%. Therefore, given a point with these four features and no label, the best way to classify this unknown point is to look to its nearest neighbors. For each vector in X_test, we will then look at its k-nearest neighbors, where ‘k’ is some fixed number which we will keep relatively small in order to estimate worser case performances. To assist us in this computation, we will define and use the following four functions.
def euclidean_distance(x1, x2) def get_neighbors_labels(X_train, y_train, X_test, k) def get_response(neighbors_labels, num_classes=3) def compute_accuracy(y_pred, y_test)
The code for these functions, as well as for the whole project can be found in the link at the bottom of the page. The first function will be used to compute pairwise Euclidean distances between all row vectors within two tensors. The second function will retrieve the labels of the k-nearest neighbors for each unlabeled datapoint in X_test. The third function will use the labels retrieved by the previous function to predict the unknown labels. An finally, the fourth function will compute the total accuracy of our predictions by comparing the predicted labels to the true (previously deleted) labels.
In order to carry out our predictions, we now simply execute the four functions, and analyze our results.
>> Training set: 112 samples >> Test set: 38 samples >> Accuracy = 0.9473684210526316
As we can see, a relatively high accuracy can be attained using a relatively low value for ‘k’. If we instead consider the worst case scenario and use the 1-nearest neighbor, we find that the accuracy drops to the following.
>> Training set: 112 samples >> Test set: 38 samples >> Accuracy = 0.8947368421052632
In machine learning, we almost always want to keep the computational complexity as low as possible, given that we often deal with massive datasets. Since we can afford the complexity in this case, we would additionally like to investigate a higher value for k, namely 16. We expect to see a large increase in accuracy.
>> Training set: 112 samples >> Test set: 38 samples >> Accuracy = 0.9736842105263158
We are interested in seeing the effect of the training and test split for worse-case scenarios, where we have less labeled data available. In the following, we execute the program for a 50-50 labeled to unlabeled split, and return to our value of k = 3.
>> Training set: 75 samples >> Test set: 75 samples >> Accuracy = 0.9466666666666667
Personally, this is higher than I anticipated, and very close to the 75-25 split. Lastly, we try out a 10-90 split.
>> Training set: 15 samples >> Test set: 135 samples >> Accuracy = 0.8962962962962963
Surprisingly high. We note that this accuracy is higher than when we had our 75-25 split with k = 1. We can conclude that even in unfavorable conditions, this semi-supervised classification technique performs well in comparison to the best unsupervised techniques which fail to find clusters intrinsic to the unlabeled data.
Git: https://github.com/ivanvgreiff/PersonalProjects/tree/main/1.%20kNN%20over%20the%20Iris%20Dataset