
Co-Training 2-View Semi-Supervised Classification

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

from mvlearn.semi_supervised import CTClassifier
from mvlearn.datasets import load_UCImultifeature

Load the UCI Multiple Digit Features Dataset as an Example for Semi-Supervised Learning

To simulate a semi-supervised learning scenario, randomly remove 98% of the labels.

data, labels = load_UCImultifeature(select_labeled=[0,1])

# Use only the first 2 views as an example
View0, View1 = data[0], data[1]

# Split both views into testing and training
View0_train, View0_test, labels_train, labels_test = train_test_split(View0, labels, test_size=0.33, random_state=42)
View1_train, View1_test, labels_train, labels_test = train_test_split(View1, labels, test_size=0.33, random_state=42)

# Randomly remove all but 4 of the labels
remove_idx = np.random.rand(len(labels_train),) < 0.98
labels_train[remove_idx] = np.nan
not_removed = np.where(remove_idx==False)
print("Remaining labeled sample labels: " + str(labels_train[not_removed]))
Remaining labeled sample labels: [1. 0. 1. 0.]

Co-Training on 2 Views vs. Single View Semi-Supervised Learning

Here, we use the default co-training classifier, which uses Gaussian naive bayes classifiers for both views. We compare its performance to the single-view semi-supervised setting with the same basic classifiers, and with the naive technique of concatenating the two views and performing single view learning.

In this case, concatenating the two views does not improve the performance over the better view. Multiview cotraining outperforms them all.

############## Single view semi-supervised learning ##############
gnb0 = GaussianNB()
gnb1 = GaussianNB()
gnb2 = GaussianNB()

# Train on only the examples with labels[not_removed,:].squeeze(), labels_train[not_removed])
y_pred0 = gnb0.predict(View0_test)[not_removed,:].squeeze(), labels_train[not_removed])
y_pred1 = gnb1.predict(View1_test)
# Concatenate the 2 views for naive "multiview" learning
View01_train = np.hstack((View0_train[not_removed,:].squeeze(), View1_train[not_removed,:].squeeze()))
View01_test = np.hstack((View0_test, View1_test)), labels_train[not_removed])
y_pred2 = gnb2.predict(View01_test)

print("Single View Accuracy on First View: {0:.3f}\n".format(accuracy_score(labels_test, y_pred0)))
print("Single View Accuracy on Second View: {0:.3f}\n".format(accuracy_score(labels_test, y_pred1)))
print("Naive Concatenated View Accuracy: {0:.3f}\n".format(accuracy_score(labels_test, y_pred2)))

######### Multi-view co-training semi-supervised learning #########
# Train a CTClassifier on all the labeled and unlabeled training data
ctc = CTClassifier()[View0_train, View1_train], labels_train)
y_pred_ct = ctc.predict([View0_test, View1_test])

print("Co-Training Accuracy on 2 Views: {0:.3f}".format(accuracy_score(labels_test, y_pred_ct)))
Single View Accuracy on First View: 0.568

Single View Accuracy on Second View: 0.591

Naive Concatenated View Accuracy: 0.591

Co-Training Accuracy on 2 Views: 0.992

Select Different Base Classifiers for the Views and Change the CTClassifier fit() parameters

Now, we use a random forest classifier with different attributes for each view. Furthermore, we manually select the number of positive (p) and negative (n) examples chosen each round in the co-training process, and we define the unlabeled pool size to draw them from and the number of iterations of training to perform.

In this case, concatenating the two views outperforms single view methods, but multiview cotraining still performs the best.

############## Single view semi-supervised learning ##############
rfc0 = RandomForestClassifier(n_estimators=100, bootstrap=True)
rfc1 = RandomForestClassifier(n_estimators=6, bootstrap=False)
rfc2 = RandomForestClassifier(n_estimators=100, bootstrap=False)

# Train on only the examples with labels[not_removed,:].squeeze(), labels_train[not_removed])
y_pred0 = rfc0.predict(View0_test)[not_removed,:].squeeze(), labels_train[not_removed])
y_pred1 = rfc1.predict(View1_test)
# Concatenate the 2 views for naive "multiview" learning
View01_train = np.hstack((View0_train[not_removed,:].squeeze(), View1_train[not_removed,:].squeeze()))
View01_test = np.hstack((View0_test, View1_test)), labels_train[not_removed])
y_pred2 = rfc2.predict(View01_test)

print("Single View Accuracy on First View: {0:.3f}\n".format(accuracy_score(labels_test, y_pred0)))
print("Single View Accuracy on Second View: {0:.3f}\n".format(accuracy_score(labels_test, y_pred1)))
print("Naive Concatenated View Accuracy: {0:.3f}\n".format(accuracy_score(labels_test, y_pred2)))

######### Multi-view co-training semi-supervised learning #########
rfc0 = RandomForestClassifier(n_estimators=100, bootstrap=True)
rfc1 = RandomForestClassifier(n_estimators=6, bootstrap=False)
ctc = CTClassifier(rfc0, rfc1, p=2, n=2, unlabeled_pool_size=20, num_iter=100)[View0_train, View1_train], labels_train)
y_pred_ct = ctc.predict([View0_test, View1_test])

print("Co-Training Accuracy: {0:.3f}".format(accuracy_score(labels_test, y_pred_ct)))
Single View Accuracy on First View: 0.902

Single View Accuracy on Second View: 0.871

Naive Concatenated View Accuracy: 0.977

Co-Training Accuracy: 0.992

Get the prediction probabilities for all the examples

y_pred_proba = ctc.predict_proba([View0_test, View1_test])
print("Full y_proba shape = " + str(y_pred_proba.shape))
print("\nFirst 10 class probabilities:\n")
Full y_proba shape = (132, 2)

First 10 class probabilities:

[[1.         0.        ]
 [0.945      0.055     ]
 [0.005      0.995     ]
 [0.09       0.91      ]
 [0.16833333 0.83166667]
 [0.995      0.005     ]
 [0.955      0.045     ]
 [0.955      0.045     ]
 [0.28       0.72      ]
 [0.925      0.075     ]]