Say we are working at a company that is looking to decide on a new store location. Our first idea might be to use sales data to predict where we should locate our new store. The only problem is we are not certain we have quality data.
The purpose of this project will be determining the accuracy of our data as well as identifying outliers and treating them appropriately (ie keep them, or remove them from the data).
import sqlite3
import pandas as pd
from scipy.stats import iqr
import numpy as np
import matplotlib.pyplot as plt
df_monthly_sales = pd.read_csv('p2-2010-pawdacity-monthly-sales-p2-2010-pawdacity-monthly-sales.csv')
df_web_scrape = pd.read_csv('p2-partially-parsed-wy-web-scrape.csv', dtype = {'2010_Census': int})
df_naics_data = pd.read_csv('p2-wy-453910-naics-data.csv')
df_demographic_data = pd.read_csv('p2-wy-demographic-data.csv')
tables = [df_demographic_data, df_monthly_sales, df_web_scrape, df_naics_data]
for table in tables:
table.columns = [x.replace(" ", "_").lower() for x in table.columns ]
#clean webscrape df
for column in df_web_scrape.columns[1:]:
#clean up tags
df_web_scrape[column] = df_web_scrape[column].apply(lambda x: x.replace('<td>', '').split('<')[0])
df_webscrape_split = df_web_scrape['city|county'].str.split("|", expand = True)
df_web_scrape = pd.concat([df_web_scrape, df_webscrape_split], axis = 1).drop('city|county', axis = 1)
df_web_scrape.rename(columns = {0: 'city', 1:'county'}, inplace = True)
df_web_scrape.city = df_web_scrape.city.str.replace('?', '').str.strip()
#clean naics df
df_naics_data = pd.concat([df_naics_data, df_naics_data.cass_lastline.str.split
(" ", expand = True).drop(3, axis =1)], axis =1)
df_naics_data.drop('cass_lastline', axis =1, inplace = True)
df_naics_data.rename(columns = {0: 'city', 1:'state', 2:'zip' }, inplace = True)
df_naics_data.city.str.strip(',');
df_naics_data.head()
df_web_scrape.rename({'2010_census': 'twenty_ten_census'}, axis = 1, inplace = True)
df_web_scrape.head()
df_demographic_data[df_demographic_data.city == 'Rock Springs']
#load database with tables for each csv
conn = sqlite3.connect('pawdacity.db')
df_monthly_sales.to_sql('sales', conn, if_exists = 'replace')
df_demographic_data.to_sql('demographic', conn, if_exists = 'replace')
df_naics_data.to_sql('competitorSales', conn, if_exists = 'replace')
df_web_scrape.to_sql('population', conn, if_exists = 'replace')
q = """WITH salesdemos AS (SELECT * FROM sales s JOIN population p ON s.city = p.city)
SELECT d.city, d.households_with_under_18, d.land_area, d.population_density, d.total_families, s.twenty_ten_census,
SUM(s.january)+SUM(s.february) + SUM(s.march)+ SUM(s.april) + SUM(s.may) + SUM(s.june)+ SUM(s.july) +
SUM(s.august) + SUM(s.september)+ SUM(s.october) + SUM(s.november) + SUM(s.december) AS total_sales
FROM demographic d
JOIN salesdemos s ON d.city = s.city
GROUP BY s.city
"""
joined_db = pd.read_sql_query(q, conn)
joined_db.twenty_ten_census = joined_db.twenty_ten_census.apply(lambda x: int(x.replace(",", "")))
joined_db.rename({'twenty_ten_census':'2010_census'}, axis = 1, inplace = True)
joined_db
for column in list(joined_db.columns)[1:]:
print(column+':', float((joined_db[column].mean())))
for column in list(joined_db.columns)[1:]:
print(column+':', int(round(joined_db[column].sum())))
f, axes = plt.subplots(3, 2, figsize = (10, 15))
i, j = 0, 0
for column in list(joined_db.columns[1:]):
axes[i, j].hist(list(joined_db[column]))
axes[i, j].set_title(column)
j += 1
if j == 2:
i +=1
j = 0
f, axes = plt.subplots(3, 2, figsize = (10, 15))
i, j = 0, 0
for column in list(joined_db.columns[1:]):
axes[i, j].boxplot(list(joined_db[column]))
axes[i, j].set_title(column)
j += 1
if j == 2:
i +=1
j = 0
From the above histograms and boxplots we can see where there are several potentially problematic outliers in our data. Let's drill down to see what cities have features with outliers so we can get an idea of how to handle them.
#expected data should fall in range: median +/- 1.5*interquartilerange()
for column in list(joined_db.columns)[1:]:
col_series = joined_db[column]
med_value = np.median(col_series)
print(column+':', int(round(col_series.sum())))
data_range = (med_value - 1.5*iqr(col_series) , 1.5*iqr(col_series) + med_value)
outlier_range[column] = data_range
print('data range: ', data_range)
df_outliers = col_series.apply(lambda x: x if x < data_range[0] or x> data_range[1] else False)
print(df_outliers[df_outliers != False], '\n')
From here we see that Chenneye (2) has the greatest number of outliers. A quick search on google finds that Chenneye is the largest city in WY, and each value roughly compares to data summarized on Wikipedia and Data USA. Next, we analyze Casper (1) households under 18. Similarly, we find that this data seems accurate and that the population tends to be older in Casper. Next, we see that Gillette (6) has a high number of sales. This seems reasonable because it is the 3rd largest city in WY. Finally, we see that the land area for Rock Springs (9) is 6620.2, which is unreasonably large (larger than Casper which is largest in WY by area). Therefore we should remove Rock Springs from our dataset.
https://en.wikipedia.org/wiki/Wyoming#Cities_and_towns
joined_db.drop(9, axis =0, inplace = True)
joined_db
While in another case we might have tried to impute the missing value or account for the missing data in some other way, here we have few data points and imputing it might significantly skew our data. If we have the ability to find the accurate information some way, this would be preferable, but assuming we do not we should remove it from our data set.