RandomizedSearchCV
is a method provided by scikit-learn for hyperparameter tuning and model selection through cross-validation. It’s similar to GridSearchCV
, but instead of exhaustively searching through all possible combinations of hyperparameters, it randomly samples a fixed number of hyperparameter settings from specified distributions.
Here’s a basic overview of how RandomizedSearchCV
works:
- Define a parameter grid or a distribution for each hyperparameter you want to tune.
- Specify the number of iterations (random samples) you want to perform.
- Pass the estimator (model), parameter grid/distributions, and number of iterations to
RandomizedSearchCV
. RandomizedSearchCV
performs cross-validation for each random combination of hyperparameters and selects the best combination based on the scoring metric.- After the search is complete, you can access attributes like
best_params_
,best_score_
, andbest_estimator_
to retrieve information about the best-performing model.
Here’s a basic example of how to use RandomizedSearchCV
:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
# Define parameter distributions
param_dist = {
'n_estimators': randint(10, 100),
'max_depth': randint(1, 10),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 20),
'max_features': ['auto', 'sqrt', 'log2']
}
# Create a RandomForestClassifier instance
clf = RandomForestClassifier()
# Create RandomizedSearchCV instance
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=10, cv=5)
# Fit the model
random_search.fit(X_train, y_train)
# Get the best parameters
best_params = random_search.best_params_
In this example, RandomizedSearchCV
is used to search for the best hyperparameters for a RandomForestClassifier
by randomly sampling from the specified parameter distributions. The n_iter
parameter controls the number of random combinations to try, and cv
specifies the number of folds for cross-validation.
Here is another example:
%%time
# Choose the type of classifier.
rf2 = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {"n_estimators": [150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
"max_depth":np.arange(3,4,5),
"class_weight" : ['balanced', 'balanced_subsample'],
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
grid_obj = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
grid_obj = grid_obj.fit(X_train, y_train)
# Print the best combination of parameters
grid_obj.best_params_