Simple Machine Learning project

Saulo Toledo Pereira
Geek Culture
Published in
5 min readJul 17, 2021

--

Github
https://github.com/saulotp/project_how-can-we-increase-revenue

Project_name: How can we increase our revenue?

Objective: In this project, a promissory start-up contacts us asking for help in how they can increase the company revenue. We, data science enthusiasts accepted this challenge seeing a great oppotunity to increase our knowledge and testing our skills in Data Analysis.

What_we_have: The company send to us a single database contain some data about the ways of where the company investing their capital and how much is returning in sales form.

Step by step

First, we have to import the python library that will help us to open our file as a DataFrame.

import pandas as pd

Now we can open the CSV file to see how the data are structured:

main_df = pd.read_csv('/content/drive/MyDrive/DataScience/Projetos/How can we increase our revenue?/advertising.csv')main_df

Our Dataframe has 200 rows and 4 columns. Let’s see if we have some null data

main_df.info()

well, we have 200 rows with 200 values non-null. The data type is float for all dataframe.

With command describe() we can see some information about the mean, count of fields, min/max values, etc.

main_df.describe()

Now we can plot some data to have a better visualization. For this, we can import library seaborn.
PS. the library ‘warnings’ is to stop seaborn plot warnings, it doesn't require to carry out the analysis.

import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sns.pairplot(main_df)

The pairplot chart shows us come interesting data about TV and Sales. There is a possible correlation between this data because we can see as TV investment increases, sales increase too. About the others data, there doesn't seem to have a pattern. Let’s plot another type of chart to try to extract more information.

sns.heatmap(main_df.corr(), cmap='Wistia', annot=True)

The image above is a heatmap, as more near to number one, more the data can be related. Once again, TV and sales look like having some kind of correlation.

With this data, we can solve the problem showing that better investment in TV can increase the company revenue. But let’s play a little with machine learning and see what can we do.

For first we will select what data the algorithm will use to be trained.

#importing train library
from sklearn.model_selection import train_test_split
#selecting columns to be trained
x = main_df.drop('Vendas', axis=1)
y = main_df['Vendas']
#spliting the train data in 'data to train' and 'data to test'
#'test size' mean that the algorithm will be trained with 70% of all #data, and will use 30% for test. But why we can't train our #algorithm with 100% of data and create an ultimate powerfull AI? #Doing this, we only will make our AI do CTRL+C / CTRL+V, the #algorithm will be very in copy past and not in predict values.
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.30)

In this case, we want to predict a numeric value to sales, then we need a regressor algorithm. How we have 3 variables to find our sale value, we will use a decision tree algorithm. (If we have 2 values that have some correlation, TV-Sales for example, we can use a simple linear regression, but with multiple variables, a decision tree is more indicate.

## importing algorithms librarys that will be used
## We will use 2 kinds of decision tree and see wich will bring us a ## better result (random forest and extratree)
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressorrandomforest = RandomForestRegressor()
extratree = ExtraTreesRegressor()
## algorithm training
randomforest.fit(x_train, y_train)
extratree.fit(x_train, y_train)

After training, we can measure how accurate our algorithms are.

from sklearn import metrics
test_random = randomforest.predict(x_test)
test_extra = extratree.predict(x_test)
r2_random = metrics.r2_score(y_test, test_random)
r2_extra = metrics.r2_score(y_test, test_extra)
print('R²')
print(f'randomforest = {r2_random:.2%}')
print(f'extratree = {r2_extra:.2%}')
print('='*20)
erro_random = metrics.mean_squared_error(y_test, test_random)
erro_extratree = metrics.mean_squared_error(y_test, test_extra)
print('error')
print(f'randomforest = {erro_random:.2f}')
print(f'extratree = {erro_extratree:.2f}')
print('='*20)

The extratree algorithm shows us a better accuracy with a precision of 93.80% and 1.86 error. Then we will use the extratree to make our predictions.

Plotting the algorithm performance.

import matplotlib.pyplot as plt
comparative_table = pd.DataFrame()
comparative_table['Real Sales values'] = y_test
comparative_table['Prediction'] = test_extra
## plot
plt.figure(figsize=(10,5))
sns.lineplot(data=comparative_table)

If we want to compare the predicted with the real data, we can print the dataframe.

display(comparative_table.head())
## for measure wich variable has more importance in our algorithm we can use the command 'feature_importances_'print(extratree.feature_importances_)

Our columns are:
1º TV
2º Radio
3º NewsPaper
For now, we know that the TV has more importance on the sales result. Radio and Newspaper are almost the same but looks like investments in Newspaper return more value in the sales company. What happens if we make a prediction simulating more investments in TV and Newspaper and for radio leaves a poor value?

To predict values we have to use the command ‘used algorithm + predict([[values]]’, in this case, the code will be:
extratree.predict([[‘tv_value’, ‘radio_value’, ‘Newpaper_value’]]) let’s play with the investment numbers, we have 100% to split in TV, radio and Newspaper, let’s see what we can do.

print(extratree.predict([['80','10', '10']]))
print(extratree.predict([['50','10', '40']]))
print(extratree.predict([['50','25', '25']]))
print(extratree.predict([['20','30', '50']]))
print(extratree.predict([['40','30', '30']]))
print(extratree.predict([['10','80', '10']]))

With this data, we can conclude that to increase the company revenue, just inject better importance in TV and the sales will increase.

In this case, we don’t look for statistical data and I’m sorry for that, but remember, I’m just a student trying to learn some things about data science. I hope this article has been helpful to you, I have a lot to learn yet. But if you have some suggestions on how can we increase the analysis of this case, feel free to talk with me.

Github
https://github.com/saulotp/project_how-can-we-increase-revenue

--

--

Saulo Toledo Pereira
Geek Culture

PhD student trying to learn some code and practice my English. Can we talk five minutes?