The `fit()`

method in `sklearn`

appears to be serving different purposes in same interface.

When applied to the training set, like so:

model.fit(X_train, y_train)

`fit()`

is used to learn parameters that will later be used on the test set with `predict(X_test)`

However, there are cases when there is no ‘learning’ involved with `fit()`

, but only some normalization to transform the data, like so:

min_max_scaler = preprocessing.MinMaxScaler() min_max_scaler.fit(X_train)

which will simply scale feature values between, say, 0 and 1, to avoid some features with higher variance to have a disproportional influence on the model.

To make things even less intuitive, sometimes the `fit()`

method that scales (and already appears to be transforming) needs to be followed by further `transform()`

method, before being called again with the `fit()`

that actually learns and builds the model, like so:

X_train2 = min_max_scaler.transform(X_train) X_test2 = min_max_scaler.transform(X_test) # the model being used knn = KNeighborsClassifier(n_neighbors=3,metric="euclidean") # learn parameters knn.fit(X_train2, y_train) # predict y_pred = knn.predict(X_test2)

Could someone please clarify the use, or multiple uses, of `fit()`

, as well as the difference of scaling and transforming the data?

## Answer

`fit()`

function provides a common interface that is shared among all scikit-learn objects.

This function takes as argument `X`

( and sometime `y`

array to compute the object’s statistics. For example, calling fit on a `MinMaxScaler`

transformer will compute its statistics (`data_min_`

, `data_max_`

, `data_range_`

…

Therefore we should see the `fit()`

function as a method that compute the necessary statistics of an object.

This commons interface is really helpful as it allows to combine transformer and estimators together using a `Pipeline`

. This allows to compute and predict all steps in one go as follows:

from sklearn.pipeline import make_pipeline from sklearn.datasets import make_classification from sklearn.preprocessing import MinMaxScaler from sklearn.neighbors import NearestNeighbors X, y = make_classification(n_samples=1000) model = make_pipeline(MinMaxScaler(), NearestNeighbors()) model.fit(X, y)

This offers also the possibility to serialize the whole model into one single object.

Without this composition module, I can agree with you that it is not very practically to work with independent transformer and estimator.