top of page

Using GridSearchCV

  • Writer: Andrew Cole
    Andrew Cole
  • May 21, 2020
  • 3 min read

Learning a reliable tool to reduce ambiguity in machine learning

As someone who is still at the beginning of their Data Science career, there is one element of working with data that is constantly hanging over my head regardless of the type of model that I am building, the size of the dataset I am using, or the deliverables which I am required to complete: ambiguity. Data, in actuality, is nothing more than just numbers and letters stored in a confusing matrix of files, folders, repositories, and more. Where we go and what we do with that data is entirely unscripted (literally..ha..ha). However, that is also the beauty of Data. We can do whatever we want with it to help make sense of it! As my skills have progressed and SOME sense of comfort has set in, I have realized that there is absolutely always a solution or route to help mitigate the confusion of data.

Classification Modeling

Classification Machine Learning is a very powerful type of algorithm in Data Science and one which I have spent countless hours scratching my head over. To refresh, Classification is a machine learning technique which allows for us to intake large amounts of data and data features and output some sort of categorization. For example, if we had a dataset giving all weather features for the past week (temperature, humidity levels, UV index, etc.) we could use a classification model to predict whether or not we would have a rainy day tomorrow or a sunny day.

There are many types of classification models, from logistic regression to boosting methods, but every one of them contains large amounts of ambiguity because they are not one-size-fits-all. They have elements, parameters and hyperparameters, which need to be adjusted to customize the needs of your data. Fortunately, as we said before, there are always ways to mitigate the ambiguity! Friends and Strangers, meet the Grid Search.

Photo by Armand Khoury on Unsplash


Parameters Vs. Hyperparameters

Before we actually get in to the Grid Search, it is very important that we understand the distinction between a parameter and a hyperparameter. These two are quite often referred to interchangeably, but in reality, they are absolutely not the same. A hyperparameter is an element of the model which cannot be learned within the model directly. Meaning, they are user inputs for the model which then help determine the performance of the model. Parameters are far more general as they are a combination of hyperparameters and other elements not directly related to the model. In better words, a hyperparameter cannot be learned by a model, while a parameter is learned by a model in the training stage.

from sklearn.tree import DecisionTreeClassifier
X = pd.DataFrame(['Feat1'],['Feat2'],[...])
y = pd.Series('target')
dt = DecisionTreeClassifier(
                           criterion = 'entropy',
                           max_depth = 6,
                           min_samples_split = 10,
                           random_state = 33
)
dt.fit(X,y)

In the above code, we see an instance of a Decision Tree classifying model. We passed several arguments in to our Decision Tree object. ‘Criterion’ and ‘max_depth’ are both instances of hyperparameters because they affect the way our model will learn.

Using the Grid Search

Hyperparameters, being the inputs to our model, are now extremely important. Every model has multiple hyperparameter possibilities to use, and those hyperparameters all have multiple input possibilities. If we allow for our model above to have too high of a depth, or too many nodes, our model will be overfit. However, if we don’t have enough, it will be under-fit. And how does max_depth affect the output’s accuracy when we increase min_samples_split? Parameters in a model are not independent of each other. The only way to really know is to try out a combination of all of them! The combinatorial grid search is the best way to navigate these new questions and find the best combination of hyperparameters and parameters for our model and it’s data.

GridSearchCV is from the sklearn library and gives us the ability to grid search our parameters. It operates by combining K-Fold Cross-Validation with a grid of parameters(model). We structure the grid as a dictionary (keys = parameter names, values = different possibilities for our combinations) and then it is passed into our estimator object.

 
 
 

Comments


bottom of page