LeaveOneOutEncoder in sklearn.pipeline

I make a pipeline with LeaveOneOutEncoder. Of course I use a toy example. Leave One Out is for transforming categorical variables

import pandas as pd import numpy as np from sklearn import preprocessing import sklearn from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from category_encoders import LeaveOneOutEncoder from sklearn import linear_model from sklearn.base import BaseEstimator, TransformerMixin df= pd.DataFrame( 'y': [1,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ], 'b': [5,5,3,4,8,6,7,3],) class ItemSelector(BaseEstimator, TransformerMixin): def __init__(self, key): self.key = key def fit(self, x, y=None): return self def transform(self, data_dict): return data_dict[self.key] class MyLEncoder(BaseEstimator, TransformerMixin): def transform(self, X, **fit_params): enc = LeaveOneOutEncoder() encc = enc.fit(np.asarray(X), y) enc_data = encc.transform(np.asarray(X)) return enc_data def fit_transform(self, X,y=None, **fit_params): self.fit(X,y, **fit_params) return self.transform(X) def fit(self, X, y, **fit_params): return self X = df[['a', 'b']] y = df['y'] regressor = linear_model.SGDRegressor() pipeline = Pipeline([ # Use FeatureUnion to combine the features ('union', FeatureUnion( transformer_list=[ # categorical ('categorical', Pipeline([ ('selector', ItemSelector(key='a')), ('one_hot', MyLEncoder()) ])), # year ])), # Use a regression ('model_fitting', linear_model.SGDRegressor()), ]) pipeline.fit(X, y) pipeline.predict(X)

That's ALL correct whan I use it on train and test data! But when i try predict a new data I get an erorr

pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))

help to find a mistake! The mistake must be simple but my eyes are swimming. And the problem must be in class MyLEncoder. What must I change?

2 Answers
2

You are calling

encc = enc.fit(np.asarray(X), y)

in the transform() method of MyLEncoder.

transform()

MyLEncoder

So there are a couple of problems here:

1) Your LeaveOneOutEncoder only remembers the last data passed to the transform of MyLEncoder and forgets the previous data.

LeaveOneOutEncoder

transform

MyLEncoder

2) During fitting LeaveOneOutEncoder requires the y to be present. But that would not be present during prediction, when MyLEncoder transform() is called.

LeaveOneOutEncoder

y

MyLEncoder

transform()

3) Currently your line:

pipeline.predict(X)

is working just by luck because your X is same, and when the MyLEncoder transform() is called, you have already defined y so it is used. But thats just wrong.

X

MyLEncoder

transform()

y

4) An unrelated thing (may not call this an error). When you do this:

pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))

pipeline.predict() requires X only, not y. But you are sending y also in that. Currently its not a problem because in the pipeline you are using only the a column and throwing away all information, but maybe in complex setups, this may slip through and data present in y column will be used as features (X data) which will then give you wrong results.

pipeline.predict()

X

y

a

y

X

To solve this, change your MyLEncoder as:

MyLEncoder

class MyLEncoder(BaseEstimator, TransformerMixin): # Save the enc during fitting def fit(self, X, y, **fit_params): enc = LeaveOneOutEncoder() self.enc = enc.fit(np.asarray(X), y) return self # Here, no new learning should be done, so never call fit() inside this # Only use the already saved enc here def transform(self, X, **fit_params): enc_data = self.enc.transform(np.asarray(X)) return enc_data # No need to define this function, if you are not doing any optimisation in it. # It will be automatically inherited from TransformerMixin # I have only kept it here, because you kept it. def fit_transform(self, X,y=None, **fit_params): self.fit(X, y, **fit_params) return self.transform(X)

Now when you do this:

pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))

You will not get any error, but still as said in point 4, I would like you to do something like this:

new_df = pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],) new_X = new_df[['a', 'b']] new_y = new_df['y'] pipeline.predict(new_X)

so that the X used in training time and new_X used in predicting time appear same.

I make self decision. Am I not right?

– Edward
Sep 19 '18 at 12:18

I've done as below

lb = df['a'] class MyLEncoder(BaseEstimator, TransformerMixin): def transform(self, X, **fit_params): enc = LeaveOneOutEncoder() encc = enc.fit(np.asarray(lb), y) enc_data = encc.transform(np.asarray(X)) return enc_data def fit_transform(self, X,y=None, **fit_params): self.fit(X,y, **fit_params) return self.transform(X) def fit(self, X, y, **fit_params): return self

So I changed X in row encc = enc.fit(np.asarray(lb), y) on lb.

X

encc = enc.fit(np.asarray(lb), y)

lb

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

搜尋此網誌

Dfyjkt