mail@pastecode.io avatar
6 months ago
1.8 kB
The gen_features function generates feature definitions for the provided column names and classes. In this case, feature_def1 is generated for mean columns using SimpleImputer with the strategy set to 'mean' and StandardScaler, while feature_def2 is generated for boolean columns using SimpleImputer with the strategy set to 'most_frequent'.

This concatenates feature_def1 and feature_def2 to create a single list of feature definitions.

The DataFrameMapper takes the feature definition list as input and creates a mapper object that can be used to transform the data.

Here, mapper.fit_transform(df_train) applies the transformations specified in feature_def_all to the df_train DataFrame, fitting the transformers and transforming the data. The resulting transformed data is stored in x_train. Similarly, mapper.transform(df_val) applies the same transformations to the df_val DataFrame (without re-fitting the transformers) and stores the transformed data in x_val.

The result is that the original DataFrame columns are transformed according to the specified feature definitions, such as imputation and scaling, and the transformed data is stored in x_train and x_val.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

mean_columns_2d = [[f] for f in mean_columns]
bool_columns_2d = [[f] for f in bool_cols]

feature_def1 = [(mean_columns_2d, [SimpleImputer(strategy='mean'), StandardScaler()])]
feature_def2 = [(bool_columns_2d, [SimpleImputer(strategy='most_frequent')])]

feature_def_all = feature_def1 + feature_def2

pipeline = Pipeline(feature_def_all)

x_train = pipeline.fit_transform(df_train).astype('float32')
x_val = pipeline.transform(df_val).astype('float32')