LeaveOneOutEncoder in sklearn.pipeline

LeaveOneOutEncoder in sklearn.pipeline



I make a pipeline with LeaveOneOutEncoder. Of course I use a toy example. Leave One Out is for transforming categorical variables


import pandas as pd
import numpy as np
from sklearn import preprocessing
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from category_encoders import LeaveOneOutEncoder
from sklearn import linear_model
from sklearn.base import BaseEstimator, TransformerMixin

df= pd.DataFrame( 'y': [1,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ], 'b': [5,5,3,4,8,6,7,3],)

class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]

class MyLEncoder(BaseEstimator, TransformerMixin):
def transform(self, X, **fit_params):
enc = LeaveOneOutEncoder()
encc = enc.fit(np.asarray(X), y)
enc_data = encc.transform(np.asarray(X))
return enc_data
def fit_transform(self, X,y=None, **fit_params):
self.fit(X,y, **fit_params)
return self.transform(X)
def fit(self, X, y, **fit_params):
return self


X = df[['a', 'b']]
y = df['y']

regressor = linear_model.SGDRegressor()

pipeline = Pipeline([

# Use FeatureUnion to combine the features
('union', FeatureUnion(
transformer_list=[


# categorical
('categorical', Pipeline([
('selector', ItemSelector(key='a')),
('one_hot', MyLEncoder())

])),
# year

])),
# Use a regression
('model_fitting', linear_model.SGDRegressor()),
])

pipeline.fit(X, y)
pipeline.predict(X)



That's ALL correct whan I use it on train and test data! But when i try predict a new data I get an erorr


pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))



help to find a mistake! The mistake must be simple but my eyes are swimming. And the problem must be in class MyLEncoder. What must I change?




2 Answers
2



You are calling


encc = enc.fit(np.asarray(X), y)



in the transform() method of MyLEncoder.


transform()


MyLEncoder



So there are a couple of problems here:



1) Your LeaveOneOutEncoder only remembers the last data passed to the transform of MyLEncoder and forgets the previous data.


LeaveOneOutEncoder


transform


MyLEncoder



2) During fitting LeaveOneOutEncoder requires the y to be present. But that would not be present during prediction, when MyLEncoder transform() is called.


LeaveOneOutEncoder


y


MyLEncoder


transform()



3) Currently your line:


pipeline.predict(X)



is working just by luck because your X is same, and when the MyLEncoder transform() is called, you have already defined y so it is used. But thats just wrong.


X


MyLEncoder


transform()


y



4) An unrelated thing (may not call this an error). When you do this:


pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))



pipeline.predict() requires X only, not y. But you are sending y also in that. Currently its not a problem because in the pipeline you are using only the a column and throwing away all information, but maybe in complex setups, this may slip through and data present in y column will be used as features (X data) which will then give you wrong results.


pipeline.predict()


X


y


y


a


y


X



To solve this, change your MyLEncoder as:


MyLEncoder


class MyLEncoder(BaseEstimator, TransformerMixin):

# Save the enc during fitting
def fit(self, X, y, **fit_params):
enc = LeaveOneOutEncoder()
self.enc = enc.fit(np.asarray(X), y)

return self

# Here, no new learning should be done, so never call fit() inside this
# Only use the already saved enc here
def transform(self, X, **fit_params):

enc_data = self.enc.transform(np.asarray(X))
return enc_data

# No need to define this function, if you are not doing any optimisation in it.
# It will be automatically inherited from TransformerMixin
# I have only kept it here, because you kept it.
def fit_transform(self, X,y=None, **fit_params):
self.fit(X, y, **fit_params)
return self.transform(X)



Now when you do this:


pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))



You will not get any error, but still as said in point 4, I would like you to do something like this:


new_df = pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],)

new_X = new_df[['a', 'b']]
new_y = new_df['y']

pipeline.predict(new_X)



so that the X used in training time and new_X used in predicting time appear same.






I make self decision. Am I not right?

– Edward
Sep 19 '18 at 12:18



I've done as below


lb = df['a']
class MyLEncoder(BaseEstimator, TransformerMixin):
def transform(self, X, **fit_params):
enc = LeaveOneOutEncoder()
encc = enc.fit(np.asarray(lb), y)
enc_data = encc.transform(np.asarray(X))

return enc_data

def fit_transform(self, X,y=None, **fit_params):
self.fit(X,y, **fit_params)
return self.transform(X)

def fit(self, X, y, **fit_params):
return self



So I changed X in row encc = enc.fit(np.asarray(lb), y) on lb.


X


encc = enc.fit(np.asarray(lb), y)


lb



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)