Importance of Scaling your data with example— Log Transformations

Harshit Sati
3 min readJun 7, 2021

Fun fact: All Euclidean based algorithms are positively affected by scaled data.

Photo by Tingey Injury Law Firm on Unsplash

Yelp Dataset

The Yelp dataset released for the academic challenge contains information for 11,537 businesses. This dataset has 8,282 check-in sets, 43,873 users, 229,907 reviews for these businesses.

In this case, for simplicity we will use the columns “review_count” to predict the number of stars awarded to a business.

import pandas as pd
import json
import
matplotlib.pyplot as plt
import seaborn as sns
biz_file = open('Datasets\yelp_academic_dataset_business.json', encoding = "utf-8")
biz_df = pd.DataFrame([json.loads(x) for x in biz_file.readlines()])
biz_file.close()

Unscaled Feature

When the untouched feature is fed into the model:

fig, ax = plt.subplots(2,1)
fig.set_figheight(15)
fig.set_figwidth(15)
biz_df['review_count'].hist(ax = ax[0], bins = 100)
ax[0].tick_params(labelsize = 14)
ax[0].set_xlabel("review counts", fontsize = 14)
ax[0].set_ylabel("occurance", fontsize = 14)

Some businesses have thousands of reviews compared to small businesses who can still have better stars than them.

Scaled Feature

We need to add one so that the log function does not explode on receiving a 0 as x

biz_df["log_rc"] = np.log(biz_df["review_count"] +1)#continuing the above code
biz_df['log_rc'].hist(ax= ax[1], bins = 100)
ax[1].tick_params(labelsize = 14)
ax[1].set_xlabel("log of review", fontsize = 14)
ax[1].set_ylabel("occurance", fontsize = 14)

We can clearly notice better distribution of the data.

Online News Popularity Dataset:

In this famous dataset, we’ll be looking at the number of words in an article (“n_tokens_count”) and select it as our only feature to feed into our linear regression model to predict the number of shares the article might have had.

Accuracy Model

model = LinearRegression()test_score = cross_val_score(model, df[[" n_tokens_content"]], df[" shares"], cv = 10 )
print(f"R squared score tokens content is {test_score.mean():.5f} +/- {test_score.std():.5f}")
test_score = cross_val_score(model, df[["log_tc"]], df[" shares"], cv = 10 )
print(f"R squared score for log( token content) is {test_score.mean():.5f} +/- {test_score.std():.5f}")

Conclusion

Hence we notice that the accuracy of the model increased with log transformation of the the feature column.

It’s negative as the number of words written is surely not a good measure of how many shares the article might’ve had.

Don’t forget to 👏it would encourage me to write more! :)

Github

https://github.com/HarshitSati/Feature_Engineering

--

--