Feature Engineering-Combination & Polynomial Features

Andrew Cole
May 1, 2020
6 min read

Improving a Linear Regression

Last week, I published a blog which walked through all steps of the linear regression modeling process. In this post, we will manipulate the data slightly in order to decrease our model result metrics. We will then walk through the most critical step in any linear regression: Feature Engineering. All code can be found in this notebook. This blog begins halfway through the notebook at the “Feature Engineering” heading.

Photo by Markus Winkler on Unsplash

Feature Engineering is the process of taking certain variables (features) from our dataset and transforming them in a predictive model. Essentially, we will be trying to manipulate single variables and combinations of variables in order to engineer new features. By creating these new features, we are increasing the likelihood that one of the new variables has more predictive power over our outcome variable than the original, un-transformed variables.

Original Model Recap

To recap, we set out with the goal of trying to predict carseat sales given generated data with several features. We pre-processed our data to account for nominal and categorical data, scaled and normalized our data, removed outliers, built our model, and verified our linear regression assumptions. Our features looked as such:

Sales (target): unit sales at each location
CompPrice: price charged by the nearest competitor at each location
Income: community income level
Advertising: local advertising budget for the company at each location
Population: population size in the region (in thousands)
Price: price charged for a car seat at each site
ShelveLoc: quality of shelving location at site (Good | Bad | Medium)
Age: average age of the local population
Education: education level at each location
Urban: whether the store is in an urban or rural location
USA: whether the store is in the US or not

The approach that we took allowed for us to piece together a well-performing model that returned an r-squared metric of 0.88 with a 0.96 root mean squared error. Both of these metrics indicate a very strong model, but we must keep in mind that this data was generated solely for the purpose of the exercise, and is not real world data. It would be nice, and save a ton of headaches, but very rarely will real-world data result in such a strong result after the first iteration through the model.

A Small Step Back

Before we get into our feature engineering, let’s first take a look at our previous regression data. Because we had such a strong performing regression, for the sake of this blog’s purpose, we want to remove the strongest impacting variable on our target variable. Covariance measures how two variables will vary in respect to each other (scale = [0, 1]). If the covariance between variable X and variable Y is high, that means that the two variables have a strong impact on each other. The code block below will show us how to see the covariance among all variables in our model training set:

X_train_prep.cov()

This simple method will output the above DataFrame, which is extremely informative but also a lot to digest. We read the variables on the y-axis and find it’s corresponding covariance scores underneath each of the x-axis variables. What we are trying to identify is which variable has the highest impact on the other variables, so that we might remove them from this model. To reiterate, we are only identifying and removing significant variables to reduce our model performance for the sake of feature engineering. In a real-world model, this would be absolutely ridiculous to do without justification from domain knowledge.

We can make the covariance output a bit more digestible by visualizing it in a seaborn heatmap.

sns.heatmap(X_train_prep.cov())

The heatmap just shows our covariance degrees via color intensity. The lighter the color, the higher the covariance of the two variables (per the scale on the right hand side of the map). We see that ‘CompPrice’ has a strong impact on itself and on ‘Price’. This inherently makes sense that the change in a competitor’s price would influence the change in the original business changing its price to reflect. Again, this is where domain knowledge is so important. Because we have two very similar predicting variables in ‘CompPrice’ and ‘Price’, and ‘CompPrice’ is a variable not native to the business selling the car seats, we will remove ‘CompPrice’ to worsen our model.

X_train_prep.drop('CompPrice', axis = 1, inplace = True)
lr2 = LinearRegression()
lr2.fit(X_train_prep, y_train)
y_hat_train = lr2.predict(X_train_prep)
r2_score(y_train, y_hat_train)

R-Squared Score = 0.7343

We see that simply removing one strong predictor decreases our R-Squared value to a significantly worse value of 0.73. In the real-world, this would actually be a somewhat promising score after an initial iteration of the model. However, we would not be satisfied with it, and would then move to feature engineering to try and improve our model.

Feature Engineering — Bivariate Combinations

Now that we have dropped ‘CompPrice’ from our DataFrame, it looks like this (data is already scaled and normalized):

During feature engineering, we want to try to create a wide variety of interactions between multiple variables in order to create new variables. For example, multiplying price by income, or advertising by urban-specific populations. By manipulating them together, we create opportunities to have new and impactful features which could potentially impact our target variable, thus engineering our features. For this argument, we will create as many bivariate combinations of our predicting variables using the ‘combinations’ method from itertools library.

from itertools import combinations

columns_list = X_train_prep.columns
interactions = list(combinations(column_list, 2))

We only need to manipulate the column titles to generate all the possible combinations, for now. Our result is a list of 45 possible bivariate combinations of our features.

While we have our combinations, it would be incredibly tedious and time consuming to test individually every single combination in a regression. Instead, we will add each combination to a dictionary, and then index the respective dictionary items as arguments in an iterative linear regression:

interaction_dict = {}
for interaction in interactions:
   X_train_int = X_train_prep
   X_train_int['int'] = X_train_int[interaction[0]] * X_train_int[interaction[1]]
   lr3 = LinearRegression()
   lr3.fit(X_train_int, y_train)
   interaction_dict[lr3.score(X_train_int, y_train)] = interaction

In the above code block we set the X_train to a new copy variable, establish a new feature, ‘int’, by multiplying the two features together, and then use that new DataFrame containing the new combination feature as the only arguments for our new LinearRegression (so excluding the original features which are already in our DataFrame).

We now actually have all of our new regressions completed, thanks to the iteration, but we can’t see them because they are all stored in the ‘interaction_dict’. Let’s sort the dictionary to return the five best performing (r-squared score) combinations.

top_5 = sorted(interaction_dict.keys(), reverse = True)[:5]
for interaction in top_5:
   print(interaction_dict[interaction])

Our top performing combinations are:

Advertising x Education
Advertising x US
US x Advertising
Price x Age
Good x Age

At this point, domain knowledge kicks it into an even higher gear. The variables returned here should start to move some cranks and gears in your brain as to why certain variables are more impactful. Get creative with it, we are engineering after all. For the sake of this blog, let’s say that our domain knowledge already assumes any categorical variable to be irrelevant because location doesn’t matter. We will only be looking at ‘Advertising x Education’ and ‘Price x Age’. Let’s run a new regression including both these combinations. First, we add the features to our original DataFrame that contained the univariate data.

X_train_int = X_train_prep
X_train_int['ad_ed'] = X_train_int['Advertising'] * X_train_int['Age']

Note the two features on the far right

Now we run another regression with our new features in the DataFrame.

lr4 = LinearRegression()
lr4.fit(X_train_int, y_train)
lr4.score(X_train_int, y_train)

R-Squared Score = .74

We can see that our linear regression score increased from .73 to .74. This is obviously a minimal increase, and one that is not super significant, but nonetheless we can see how creating new bivariate feature terms can play a significant role in improving our model.

Feature Engineering — Polynomials

Again, there are no clear-cut guidelines to test all possible variable manipulations. We have just seen how to make two variables interact together, but what is another way we can engineer new features? Polynomials! A very strong (usually) option for new features is increasing the power of a single variable. For our purposes, we will try and see if all the existing variables, including our bivariate combinations, can improve our regression by being increased in power.

To do so, we will use the ‘PolynomialFeatures’ object from the sklearn.preprocessing library. We create a new, empty dictionary for storage of our new feature possibilities (same as bivariates). We will then iterate through our X_train_int feature set and create a new feature for each respective feature to be squared thru fifth-power. We will then fit a linear regression with each of the new individual features and choose the best performing feature for use in our final regression.

from sklearn.preprocessing import PolynomialFeatures

poly_dict = {}
for feature in X_train_int.columns:
   for p in range(2, 5):
      X_train_poly = X_train_int
      X_train_poly['sq'] = X_train_poly[feature] ** p
      lr = LinearRegression()
      lr.fit(X_train_poly, y_train)
      poly_dict[lr.score(X_train_poly, y_train) = [feature, p]

poly_dict[max(poly_dict.keys())]

R-Squared Score = 0.743

Once again, we see a tiny yet still positive increase in our R-squared metric. There are numerous methods for engineering new possibilities and improving your regression, but these are a great starting point for any new data scientist!

#LinearRegression #DataAnalysis #BusinessAnalytics #MachineLearning #datascience