Scikit-learn

Installation

We should install scikit-learn though the package we used is called sklearn.

Linear regression

Scikit-learn has a gradient descent linear regression model SGDRegressor that performs well with normalized inputs. StandardScaler will perform z-score normalization as we learnt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# import module
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
# Scale training set
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
# Create and fit model
sgdr = SGDRegressor(max_iter=1000)
sgdr.fit(X_norm, y_train)
# View parameters
b_norm = sgdr.intercept_
w_norm = sgdr.coef_
# Make predictions
y_pred = sgdr.predict(X_norm)

In addition to linear regression using gradient descent, scikit-learn also implements another linear regression model using normal equation, that is LinearRegression.

1
2
3
4
5
6
7
8
9
10
11
# import module
from sklearn.linear_model import LinearRegression
# Create and fit model
linear_model = LinearRegression()
## X must be a 2-D matrix
linear_model.fit(X_train.reshape(-1, 1), y_train)
# View parameters
b = linear_model.intercept_
w = linear_model.coef_
# Make predictions
y_pred = linear_model.predict(X_train.reshape(-1, 1))

Logistic regression

The logistic regression model in scikit-learn is LogisticRegression.

1
2
3
4
5
6
7
8
9
# import module
from sklearn.linear_model import LogisticRegression
# Create and fit the model
lr_model = LogisticRegression()
lr_model.fit(X, y)
# Make predictions
y_pred = lr_model.predict(X)
# Calculate accuracy
lr_model.score(X, y) # Return the percentage of correct predictions

Datasets and dataset partition

The sklearn.datasets module includes utilities to load datasets. These datasets are useful for the training of model. See more information on sklearn.datasets.

Module train_test_split in sklearn.model_selection can help us split training set into training set and test set, for examples:

1
2
X_train, X_, y_train, y_ = train_test_split(X, y, test_size=0.4, random_state=1)
X_cv, X_test, y_cv, y_test = train_test_split(X_, y_, test_size=0.5, random_state=1)

More information