Importance of Scaling your data with example— Log Transformations
Fun fact: All Euclidean based algorithms are positively affected by scaled data.
Yelp Dataset
The Yelp dataset released for the academic challenge contains information for 11,537 businesses. This dataset has 8,282 check-in sets, 43,873 users, 229,907 reviews for these businesses.
In this case, for simplicity we will use the columns “review_count” to predict the number of stars awarded to a business.
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as snsbiz_file = open('Datasets\yelp_academic_dataset_business.json', encoding = "utf-8")
biz_df = pd.DataFrame([json.loads(x) for x in biz_file.readlines()])
biz_file.close()
Unscaled Feature
When the untouched feature is fed into the model:
fig, ax = plt.subplots(2,1)
fig.set_figheight(15)
fig.set_figwidth(15)
biz_df['review_count'].hist(ax = ax[0], bins = 100)
ax[0].tick_params(labelsize = 14)
ax[0].set_xlabel("review counts", fontsize = 14)
ax[0].set_ylabel("occurance", fontsize = 14)
Some businesses have thousands of reviews compared to small businesses who can still have better stars than them.
Scaled Feature
We need to add one so that the log function does not explode on receiving a 0 as x
biz_df["log_rc"] = np.log(biz_df["review_count"] +1)#continuing the above code
biz_df['log_rc'].hist(ax= ax[1], bins = 100)
ax[1].tick_params(labelsize = 14)
ax[1].set_xlabel("log of review", fontsize = 14)
ax[1].set_ylabel("occurance", fontsize = 14)
We can clearly notice better distribution of the data.
Online News Popularity Dataset:
In this famous dataset, we’ll be looking at the number of words in an article (“n_tokens_count”) and select it as our only feature to feed into our linear regression model to predict the number of shares the article might have had.
Accuracy Model
model = LinearRegression()test_score = cross_val_score(model, df[[" n_tokens_content"]], df[" shares"], cv = 10 )
print(f"R squared score tokens content is {test_score.mean():.5f} +/- {test_score.std():.5f}")
test_score = cross_val_score(model, df[["log_tc"]], df[" shares"], cv = 10 )
print(f"R squared score for log( token content) is {test_score.mean():.5f} +/- {test_score.std():.5f}")
Conclusion
Hence we notice that the accuracy of the model increased with log transformation of the the feature column.
It’s negative as the number of words written is surely not a good measure of how many shares the article might’ve had.
Don’t forget to 👏it would encourage me to write more! :)
Github