LeaveOneOutEncoder in sklearn.pipeline
LeaveOneOutEncoder in sklearn.pipeline
I make a pipeline with LeaveOneOutEncoder. Of course I use a toy example. Leave One Out is for transforming categorical variables
import pandas as pd
import numpy as np
from sklearn import preprocessing
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from category_encoders import LeaveOneOutEncoder
from sklearn import linear_model
from sklearn.base import BaseEstimator, TransformerMixin
df= pd.DataFrame( 'y': [1,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ], 'b': [5,5,3,4,8,6,7,3],)
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
class MyLEncoder(BaseEstimator, TransformerMixin):
def transform(self, X, **fit_params):
enc = LeaveOneOutEncoder()
encc = enc.fit(np.asarray(X), y)
enc_data = encc.transform(np.asarray(X))
return enc_data
def fit_transform(self, X,y=None, **fit_params):
self.fit(X,y, **fit_params)
return self.transform(X)
def fit(self, X, y, **fit_params):
return self
X = df[['a', 'b']]
y = df['y']
regressor = linear_model.SGDRegressor()
pipeline = Pipeline([
# Use FeatureUnion to combine the features
('union', FeatureUnion(
transformer_list=[
# categorical
('categorical', Pipeline([
('selector', ItemSelector(key='a')),
('one_hot', MyLEncoder())
])),
# year
])),
# Use a regression
('model_fitting', linear_model.SGDRegressor()),
])
pipeline.fit(X, y)
pipeline.predict(X)
That's ALL correct whan I use it on train and test data! But when i try predict a new data I get an erorr
pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))
help to find a mistake! The mistake must be simple but my eyes are swimming. And the problem must be in class MyLEncoder. What must I change?
2 Answers
2
You are calling
encc = enc.fit(np.asarray(X), y)
in the transform()
method of MyLEncoder
.
transform()
MyLEncoder
So there are a couple of problems here:
1) Your LeaveOneOutEncoder
only remembers the last data passed to the transform
of MyLEncoder
and forgets the previous data.
LeaveOneOutEncoder
transform
MyLEncoder
2) During fitting LeaveOneOutEncoder
requires the y
to be present. But that would not be present during prediction, when MyLEncoder
transform()
is called.
LeaveOneOutEncoder
y
MyLEncoder
transform()
3) Currently your line:
pipeline.predict(X)
is working just by luck because your X
is same, and when the MyLEncoder
transform()
is called, you have already defined y
so it is used. But thats just wrong.
X
MyLEncoder
transform()
y
4) An unrelated thing (may not call this an error). When you do this:
pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))
pipeline.predict()
requires X
only, not y
. But you are sending y
also in that. Currently its not a problem because in the pipeline you are using only the a
column and throwing away all information, but maybe in complex setups, this may slip through and data present in y
column will be used as features (X
data) which will then give you wrong results.
pipeline.predict()
X
y
y
a
y
X
To solve this, change your MyLEncoder
as:
MyLEncoder
class MyLEncoder(BaseEstimator, TransformerMixin):
# Save the enc during fitting
def fit(self, X, y, **fit_params):
enc = LeaveOneOutEncoder()
self.enc = enc.fit(np.asarray(X), y)
return self
# Here, no new learning should be done, so never call fit() inside this
# Only use the already saved enc here
def transform(self, X, **fit_params):
enc_data = self.enc.transform(np.asarray(X))
return enc_data
# No need to define this function, if you are not doing any optimisation in it.
# It will be automatically inherited from TransformerMixin
# I have only kept it here, because you kept it.
def fit_transform(self, X,y=None, **fit_params):
self.fit(X, y, **fit_params)
return self.transform(X)
Now when you do this:
pipeline.predict(pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],))
You will not get any error, but still as said in point 4, I would like you to do something like this:
new_df = pd.DataFrame( 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],)
new_X = new_df[['a', 'b']]
new_y = new_df['y']
pipeline.predict(new_X)
so that the X used in training time and new_X used in predicting time appear same.
I've done as below
lb = df['a']
class MyLEncoder(BaseEstimator, TransformerMixin):
def transform(self, X, **fit_params):
enc = LeaveOneOutEncoder()
encc = enc.fit(np.asarray(lb), y)
enc_data = encc.transform(np.asarray(X))
return enc_data
def fit_transform(self, X,y=None, **fit_params):
self.fit(X,y, **fit_params)
return self.transform(X)
def fit(self, X, y, **fit_params):
return self
So I changed X
in row encc = enc.fit(np.asarray(lb), y)
on lb
.
X
encc = enc.fit(np.asarray(lb), y)
lb
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy
I make self decision. Am I not right?
– Edward
Sep 19 '18 at 12:18