BUSINESS AND DATA UNDERSTANDING¶

Say we are working at a company that is looking to decide on a new store location. Our first idea might be to use sales data to predict where we should locate our new store. The only problem is we are not certain we have quality data.

The purpose of this project will be determining the accuracy of our data as well as identifying outliers and treating them appropriately (ie keep them, or remove them from the data).

In [26]:

import sqlite3 
import pandas as pd 
from scipy.stats import iqr
import numpy as np
import matplotlib.pyplot as plt

In [27]:

df_monthly_sales = pd.read_csv('p2-2010-pawdacity-monthly-sales-p2-2010-pawdacity-monthly-sales.csv')
df_web_scrape = pd.read_csv('p2-partially-parsed-wy-web-scrape.csv', dtype = {'2010_Census': int})
df_naics_data = pd.read_csv('p2-wy-453910-naics-data.csv')
df_demographic_data = pd.read_csv('p2-wy-demographic-data.csv')

tables = [df_demographic_data, df_monthly_sales, df_web_scrape, df_naics_data]

for table in tables:
    table.columns = [x.replace(" ", "_").lower() for x in table.columns ]

In [28]:

#clean webscrape df

for column in df_web_scrape.columns[1:]:
    #clean up tags 
    df_web_scrape[column] = df_web_scrape[column].apply(lambda x: x.replace('<td>', '').split('<')[0])

df_webscrape_split = df_web_scrape['city|county'].str.split("|", expand = True)
df_web_scrape = pd.concat([df_web_scrape, df_webscrape_split], axis = 1).drop('city|county', axis = 1)
df_web_scrape.rename(columns = {0: 'city', 1:'county'}, inplace = True)
df_web_scrape.city = df_web_scrape.city.str.replace('?', '').str.strip()

In [29]:

#clean naics df 

df_naics_data = pd.concat([df_naics_data, df_naics_data.cass_lastline.str.split
                           (" ", expand = True).drop(3, axis =1)], axis =1)
df_naics_data.drop('cass_lastline', axis =1, inplace = True)
df_naics_data.rename(columns = {0: 'city', 1:'state', 2:'zip' }, inplace = True)
df_naics_data.city.str.strip(',');

In [30]:

df_naics_data.head()

Out[30]:

	business_name	physical_city_name	sales_volume	city	state	zip
0	Mile High Mobile Pet LLC	Cheyenne	300000	Cheyenne,	WY	82007-3528
1	Pets City Inc	Cheyenne	640000	Cheyenne,	WY	82009-4851
2	Petco Animal Sups Stores Inc	Cheyenne	0	Cheyenne,	WY	82009-4945
3	Pet-A-Care	Cheyenne	81000	Cheyenne,	WY	82009-1009
4	Muddy Paws Pet Salon	Laramie	76000	Laramie,	WY	82070-8979

In [31]:

df_web_scrape.rename({'2010_census': 'twenty_ten_census'}, axis = 1, inplace = True)
df_web_scrape.head()

Out[31]:

	2014_estimate	twenty_ten_census	2000_census	city	county
0	1,968	1,911	1,818	Afton	Lincoln
1	185	181	120	Albin	Laramie
2	845	828	550	Alpine	Lincoln
3	439	440	348	Baggs	Carbon
4	107	106	97	Bairoil	Sweetwater

In [32]:

df_demographic_data[df_demographic_data.city == 'Rock Springs']

Out[32]:

	city	county	land_area	households_with_under_18	population_density	total_families
87	Rock Springs	Sweetwater	6620.201916	4022	2.78	7572.18

In [33]:

#load database with tables for each csv 

conn = sqlite3.connect('pawdacity.db')

df_monthly_sales.to_sql('sales', conn, if_exists = 'replace')
df_demographic_data.to_sql('demographic', conn, if_exists = 'replace')
df_naics_data.to_sql('competitorSales', conn, if_exists = 'replace')
df_web_scrape.to_sql('population', conn, if_exists = 'replace')

In [34]:

q = """WITH salesdemos AS (SELECT * FROM sales s JOIN population p ON s.city = p.city)
        

        SELECT d.city, d.households_with_under_18, d.land_area, d.population_density, d.total_families, s.twenty_ten_census,
        SUM(s.january)+SUM(s.february) + SUM(s.march)+ SUM(s.april) + SUM(s.may) + SUM(s.june)+ SUM(s.july) +
        SUM(s.august) + SUM(s.september)+ SUM(s.october) + SUM(s.november) + SUM(s.december) AS total_sales
        FROM demographic d 
        JOIN salesdemos s ON d.city = s.city
        GROUP BY s.city
        """

In [35]:

joined_db = pd.read_sql_query(q, conn)
joined_db.twenty_ten_census = joined_db.twenty_ten_census.apply(lambda x: int(x.replace(",", "")))
joined_db.rename({'twenty_ten_census':'2010_census'}, axis = 1, inplace = True)
joined_db

Out[35]:

	city	households_with_under_18	land_area	population_density	total_families	2010_census	total_sales
0	Buffalo	746	3115.507500	1.55	1819.50	4585	185328
1	Casper	7788	3894.309100	11.16	8756.32	35316	317736
2	Cheyenne	7158	1500.178400	20.34	14612.64	59466	917892
3	Cody	1403	2998.956960	1.82	3515.62	9520	218376
4	Douglas	832	1829.465100	1.46	1744.08	6120	208008
5	Evanston	1486	999.497100	4.95	2712.64	12359	283824
6	Gillette	4052	2748.852900	5.80	7189.43	29087	543132
7	Powell	1251	2673.574550	1.62	3134.18	6314	233928
8	Riverton	2680	4796.859815	2.34	5556.49	10615	303264
9	Rock Springs	4022	6620.201916	2.78	7572.18	23036	253584
10	Sheridan	2646	1893.977048	8.98	6039.71	17444	308232

AVERAGES¶

In [36]:

for column in list(joined_db.columns)[1:]:
    print(column+':', float((joined_db[column].mean())))

households_with_under_18: 3096.7272727272725
land_area: 3006.489126272727
population_density: 5.70909090909091
total_families: 5695.708181818181
2010_census: 19442.0
total_sales: 343027.63636363635

In [37]:

for column in list(joined_db.columns)[1:]:
    print(column+':', int(round(joined_db[column].sum())))

households_with_under_18: 34064
land_area: 33071
population_density: 63
total_families: 62653
2010_census: 213862
total_sales: 3773304

In [43]:

f, axes = plt.subplots(3, 2, figsize = (10, 15))

i, j = 0, 0 
for column in list(joined_db.columns[1:]):
    axes[i, j].hist(list(joined_db[column]))
    axes[i, j].set_title(column)
    j += 1
    if j == 2:
        i +=1 
        j = 0

OUTLIERS¶

In [44]:

f, axes = plt.subplots(3, 2, figsize = (10, 15))

i, j = 0, 0 
for column in list(joined_db.columns[1:]):
    axes[i, j].boxplot(list(joined_db[column]))
    axes[i, j].set_title(column)
    j += 1
    if j == 2:
        i +=1 
        j = 0

From the above histograms and boxplots we can see where there are several potentially problematic outliers in our data. Let's drill down to see what cities have features with outliers so we can get an idea of how to handle them.

In [48]:

#expected data should fall in range: median +/- 1.5*interquartilerange()


for column in list(joined_db.columns)[1:]:
    col_series = joined_db[column]
    med_value = np.median(col_series)
    print(column+':', int(round(col_series.sum())))
    
    data_range = (med_value - 1.5*iqr(col_series) , 1.5*iqr(col_series) + med_value)
    outlier_range[column] = data_range
    print('data range: ', data_range)
    
    df_outliers = col_series.apply(lambda x: x if x < data_range[0] or x> data_range[1] else False)
    print(df_outliers[df_outliers != False], '\n')

households_with_under_18: 30042
data range:  (-1564.0, 5696.0)
1    7788
2    7158
Name: households_with_under_18, dtype: object 

land_area: 26451
data range:  (850.0485579999995, 4572.378892)
8    4796.86
Name: land_area, dtype: object 

population_density: 60
data range:  (-6.127500000000001, 13.4175)
2    20.34
Name: population_density, dtype: object 

total_families: 55081
data range:  (-1589.9074999999993, 10662.0175)
2    14612.6
Name: total_families, dtype: object 

2010_census: 190826
data range:  (-17104.125, 40078.125)
2    59466
Name: 2010_census, dtype: object 

total_sales: 3519720
data range:  (153900.0, 433188.0)
2    917892
6    543132
Name: total_sales, dtype: object

How to Handle the Outliers¶

From here we see that Chenneye (2) has the greatest number of outliers. A quick search on google finds that Chenneye is the largest city in WY, and each value roughly compares to data summarized on Wikipedia and Data USA. Next, we analyze Casper (1) households under 18. Similarly, we find that this data seems accurate and that the population tends to be older in Casper. Next, we see that Gillette (6) has a high number of sales. This seems reasonable because it is the 3rd largest city in WY. Finally, we see that the land area for Rock Springs (9) is 6620.2, which is unreasonably large (larger than Casper which is largest in WY by area). Therefore we should remove Rock Springs from our dataset.

https://en.wikipedia.org/wiki/Wyoming#Cities_and_towns

https://en.wikipedia.org/wiki/Cheyenne,_Wyoming

https://datausa.io/profile/geo/cheyenne-wy/

In [46]:

joined_db.drop(9, axis =0, inplace = True)

In [47]:

joined_db

Out[47]:

	city	households_with_under_18	land_area	population_density	total_families	2010_census	total_sales
0	Buffalo	746	3115.507500	1.55	1819.50	4585	185328
1	Casper	7788	3894.309100	11.16	8756.32	35316	317736
2	Cheyenne	7158	1500.178400	20.34	14612.64	59466	917892
3	Cody	1403	2998.956960	1.82	3515.62	9520	218376
4	Douglas	832	1829.465100	1.46	1744.08	6120	208008
5	Evanston	1486	999.497100	4.95	2712.64	12359	283824
6	Gillette	4052	2748.852900	5.80	7189.43	29087	543132
7	Powell	1251	2673.574550	1.62	3134.18	6314	233928
8	Riverton	2680	4796.859815	2.34	5556.49	10615	303264
10	Sheridan	2646	1893.977048	8.98	6039.71	17444	308232

While in another case we might have tried to impute the missing value or account for the missing data in some other way, here we have few data points and imputing it might significantly skew our data. If we have the ability to find the accurate information some way, this would be preferable, but assuming we do not we should remove it from our data set.

Finding a New Pet Store Location

Handling Outliers

BUSINESS AND DATA UNDERSTANDING¶

AVERAGES¶

OUTLIERS¶

How to Handle the Outliers¶