ValueError: The number of class labels must be greater than one in Passive Aggressive Classifier

ValueError: The number of class labels must be greater than one in Passive Aggressive Classifier



I am trying to implement an online classifier using the 'passive agressive classifer' in scikit learn with the 20 news grops dataset. I am very new to this, thus I am not sure if I have implemented this properly. That being said, I developed a samll code but when I execute it I keep getting the error:



Traceback (most recent call last): File "/home/suleka/Documents/RNN
models/passiveagressive.py", line 100, in
clf.fit(X, y) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/passive_aggressive.py",
line 225, in fit
coef_init=coef_init, intercept_init=intercept_init) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 444, in _fit
classes, sample_weight, coef_init, intercept_init) File "/home/suleka/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 407, in _partial_fit
raise ValueError("The number of class labels must be " ValueError: The number of class labels must be greater than one.



I checked most posts in stackoverflow and they suggested there must be only one unique class. So i did np.unique(labels) and it showed 20 (20 news groups):


np.unique(labels)


[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]



Can anyone help me out with this error and please let me know if I have implemented it wrong.



My code is shown below:


from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.datasets import make_classification
from string import punctuation
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from collections import Counter
from sklearn.preprocessing import MinMaxScaler, LabelBinarizer
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')



seed = 42
np.random.seed(seed)

def preProcess():

newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')

features = vectorizer.fit_transform(newsgroups_data.data)
labels= newsgroups_data.target

return features, labels


if __name__ == '__main__':

features, labels = preProcess()

X_train, y_train = shuffle(features, labels, random_state=seed)

clf = PassiveAggressiveClassifier(random_state=seed)

n, d =X_train.shape

print(np.unique(labels))

error = 0
iteration = 0
for i in range(n):
print(iteration)
X, y = X_train[i:i + 1], y_train[i:i + 1]

clf.fit(X, y)
pred = clf.predict(X)

print(pred)
print(y)

if y - pred != 0:
error += 1
iteration += iteration


print(error)
print(np.divide(error, n, dtype=np.float))



Thank you in advance!




1 Answer
1



The issue lies in this line:


X, y = X_train[i:i + 1], y_train[i:i + 1]



which in inside your for loop, i.e. after you have asked for np.unique(labels) and comfortably found that indeed you have all 20 ones...


for


np.unique(labels)



Looking closely, you will realize that this line results to a X and y of only one element each (X_train[i] and y_train[i], respectively - in fact, since the error arguably happens in the very first iteration for i=0, you end up with only X_train[0] and y_train[0]), which of course should not be the case when fitting a model; hence, the error message correctly points out that you have only one label in your set (because you have only one sample, that is)...


X


y


X_train[i]


y_train[i]


i=0


X_train[0]


y_train[0]



To convince yourself that this is the case indeed, just insert a print(np.unique(y)) before your clf.fit() - it will print only one label.


print(np.unique(y))


clf.fit()



It is quite unclear what exactly you are trying to achieve with your for loop; if you are trying to train your classifier to successive pieces of your dataset, you could try changing the [i:i+1] indices to [i:i+k] for some large enough k, but for a 20-label dataset this is not so simple, as you have to ensure that all 20 labels will be present for each call to clf.fit(), otherwise you will end up comparing apples to oranges...


for


[i:i+1]


[i:i+k]


k


clf.fit()



I strongly suggest to start simple: remove the for loop, fit your classifier to the whole of your training set (clf.fit(X_train, y_train)), and check the documentation of scikit-learn for the available performance metrics you can use...


for


clf.fit(X_train, y_train)



EDIT I just noticed the detail:



I am trying to implement an online classifier



Well, what you are trying to do is certainly not online training (which is a huge topic by itself), as your for loop simply retrains (it tries to, at least) a new classifier from scratch during each iteration.


for



As I already said, start simple; try to firmly grasp the principles of simple batch training first, before moving to the much more advanced topic of online training, which is definitely not a beginner's one...






Thank you I was guessing that might be the problem. But I am trying to compare this with an RNN online training model so I wanted to compare the performance of that with a machine learning algorithm. That is why I went for this approach. If I feed the whole data set at once, will it not destroy the whole purpose of it? I cound't find any proper material on this online.

– Suleka_28
Sep 14 '18 at 3:34






@Suleka_28 you are very welcome; and since the answer resolved the particular error the question was about, kindly accept it - thanks

– desertnaut
Sep 14 '18 at 7:38



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)