ValueError: The number of class labels must be greater than one in Passive Aggressive Classifier
ValueError: The number of class labels must be greater than one in Passive Aggressive Classifier
I am trying to implement an online classifier using the 'passive agressive classifer' in scikit learn with the 20 news grops dataset. I am very new to this, thus I am not sure if I have implemented this properly. That being said, I developed a samll code but when I execute it I keep getting the error:
Traceback (most recent call last): File "/home/suleka/Documents/RNN
models/passiveagressive.py", line 100, in
clf.fit(X, y) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/passive_aggressive.py",
line 225, in fit
coef_init=coef_init, intercept_init=intercept_init) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 444, in _fit
classes, sample_weight, coef_init, intercept_init) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 407, in _partial_fit
raise ValueError("The number of class labels must be " ValueError: The number of class labels must be greater than one.
I checked most posts in stackoverflow and they suggested there must be only one unique class. So i did np.unique(labels)
and it showed 20 (20 news groups):
np.unique(labels)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
Can anyone help me out with this error and please let me know if I have implemented it wrong.
My code is shown below:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.datasets import make_classification
from string import punctuation
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from collections import Counter
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
seed = 42
np.random.seed(seed)
def preProcess():
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
features = vectorizer.fit_transform(newsgroups_data.data)
labels= newsgroups_data.target
return features, labels
if __name__ == '__main__':
features, labels = preProcess()
X_train, y_train = shuffle(features, labels, random_state=seed)
clf = PassiveAggressiveClassifier(random_state=seed)
n, d =X_train.shape
print(np.unique(labels))
error = 0
iteration = 0
for i in range(n):
print(iteration)
X, y = X_train[i:i + 1], y_train[i:i + 1]
clf.fit(X, y)
pred = clf.predict(X)
print(pred)
print(y)
if y - pred != 0:
error += 1
iteration += iteration
print(error)
print(np.divide(error, n, dtype=np.float))
Thank you in advance!
1 Answer
1
The issue lies in this line:
X, y = X_train[i:i + 1], y_train[i:i + 1]
which in inside your for
loop, i.e. after you have asked for np.unique(labels)
and comfortably found that indeed you have all 20 ones...
for
np.unique(labels)
Looking closely, you will realize that this line results to a X
and y
of only one element each (X_train[i]
and y_train[i]
, respectively - in fact, since the error arguably happens in the very first iteration for i=0
, you end up with only X_train[0]
and y_train[0]
), which of course should not be the case when fitting a model; hence, the error message correctly points out that you have only one label in your set (because you have only one sample, that is)...
X
y
X_train[i]
y_train[i]
i=0
X_train[0]
y_train[0]
To convince yourself that this is the case indeed, just insert a print(np.unique(y))
before your clf.fit()
- it will print only one label.
print(np.unique(y))
clf.fit()
It is quite unclear what exactly you are trying to achieve with your for
loop; if you are trying to train your classifier to successive pieces of your dataset, you could try changing the [i:i+1]
indices to [i:i+k]
for some large enough k
, but for a 20-label dataset this is not so simple, as you have to ensure that all 20 labels will be present for each call to clf.fit()
, otherwise you will end up comparing apples to oranges...
for
[i:i+1]
[i:i+k]
k
clf.fit()
I strongly suggest to start simple: remove the for
loop, fit your classifier to the whole of your training set (clf.fit(X_train, y_train)
), and check the documentation of scikit-learn for the available performance metrics you can use...
for
clf.fit(X_train, y_train)
EDIT I just noticed the detail:
I am trying to implement an online classifier
Well, what you are trying to do is certainly not online training (which is a huge topic by itself), as your for
loop simply retrains (it tries to, at least) a new classifier from scratch during each iteration.
for
As I already said, start simple; try to firmly grasp the principles of simple batch training first, before moving to the much more advanced topic of online training, which is definitely not a beginner's one...
@Suleka_28 you are very welcome; and since the answer resolved the particular error the question was about, kindly accept it - thanks
– desertnaut
Sep 14 '18 at 7:38
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thank you I was guessing that might be the problem. But I am trying to compare this with an RNN online training model so I wanted to compare the performance of that with a machine learning algorithm. That is why I went for this approach. If I feed the whole data set at once, will it not destroy the whole purpose of it? I cound't find any proper material on this online.
– Suleka_28
Sep 14 '18 at 3:34