Finding a New Pet Store Location

Handling Outliers

BUSINESS AND DATA UNDERSTANDING

Say we are working at a company that is looking to decide on a new store location. Our first idea might be to use sales data to predict where we should locate our new store. The only problem is we are not certain we have quality data.

The purpose of this project will be determining the accuracy of our data as well as identifying outliers and treating them appropriately (ie keep them, or remove them from the data).

In [26]:
import sqlite3 
import pandas as pd 
from scipy.stats import iqr
import numpy as np
import matplotlib.pyplot as plt
In [27]:
df_monthly_sales = pd.read_csv('p2-2010-pawdacity-monthly-sales-p2-2010-pawdacity-monthly-sales.csv')
df_web_scrape = pd.read_csv('p2-partially-parsed-wy-web-scrape.csv', dtype = {'2010_Census': int})
df_naics_data = pd.read_csv('p2-wy-453910-naics-data.csv')
df_demographic_data = pd.read_csv('p2-wy-demographic-data.csv')

tables = [df_demographic_data, df_monthly_sales, df_web_scrape, df_naics_data]

for table in tables:
    table.columns = [x.replace(" ", "_").lower() for x in table.columns ]
In [28]:
#clean webscrape df

for column in df_web_scrape.columns[1:]:
    #clean up tags 
    df_web_scrape[column] = df_web_scrape[column].apply(lambda x: x.replace('<td>', '').split('<')[0])

df_webscrape_split = df_web_scrape['city|county'].str.split("|", expand = True)
df_web_scrape = pd.concat([df_web_scrape, df_webscrape_split], axis = 1).drop('city|county', axis = 1)
df_web_scrape.rename(columns = {0: 'city', 1:'county'}, inplace = True)
df_web_scrape.city = df_web_scrape.city.str.replace('?', '').str.strip()
In [29]:
#clean naics df 

df_naics_data = pd.concat([df_naics_data, df_naics_data.cass_lastline.str.split
                           (" ", expand = True).drop(3, axis =1)], axis =1)
df_naics_data.drop('cass_lastline', axis =1, inplace = True)
df_naics_data.rename(columns = {0: 'city', 1:'state', 2:'zip' }, inplace = True)
df_naics_data.city.str.strip(',');
In [30]:
df_naics_data.head()
Out[30]:
business_name physical_city_name sales_volume city state zip
0 Mile High Mobile Pet LLC Cheyenne 300000 Cheyenne, WY 82007-3528
1 Pets City Inc Cheyenne 640000 Cheyenne, WY 82009-4851
2 Petco Animal Sups Stores Inc Cheyenne 0 Cheyenne, WY 82009-4945
3 Pet-A-Care Cheyenne 81000 Cheyenne, WY 82009-1009
4 Muddy Paws Pet Salon Laramie 76000 Laramie, WY 82070-8979
In [31]:
df_web_scrape.rename({'2010_census': 'twenty_ten_census'}, axis = 1, inplace = True)
df_web_scrape.head()
Out[31]:
2014_estimate twenty_ten_census 2000_census city county
0 1,968 1,911 1,818 Afton Lincoln
1 185 181 120 Albin Laramie
2 845 828 550 Alpine Lincoln
3 439 440 348 Baggs Carbon
4 107 106 97 Bairoil Sweetwater
In [32]:
df_demographic_data[df_demographic_data.city == 'Rock Springs']
Out[32]:
city county land_area households_with_under_18 population_density total_families
87 Rock Springs Sweetwater 6620.201916 4022 2.78 7572.18
In [33]:
#load database with tables for each csv 

conn = sqlite3.connect('pawdacity.db')

df_monthly_sales.to_sql('sales', conn, if_exists = 'replace')
df_demographic_data.to_sql('demographic', conn, if_exists = 'replace')
df_naics_data.to_sql('competitorSales', conn, if_exists = 'replace')
df_web_scrape.to_sql('population', conn, if_exists = 'replace')
In [34]:
q = """WITH salesdemos AS (SELECT * FROM sales s JOIN population p ON s.city = p.city)
        

        SELECT d.city, d.households_with_under_18, d.land_area, d.population_density, d.total_families, s.twenty_ten_census,
        SUM(s.january)+SUM(s.february) + SUM(s.march)+ SUM(s.april) + SUM(s.may) + SUM(s.june)+ SUM(s.july) +
        SUM(s.august) + SUM(s.september)+ SUM(s.october) + SUM(s.november) + SUM(s.december) AS total_sales
        FROM demographic d 
        JOIN salesdemos s ON d.city = s.city
        GROUP BY s.city
        """
In [35]:
joined_db = pd.read_sql_query(q, conn)
joined_db.twenty_ten_census = joined_db.twenty_ten_census.apply(lambda x: int(x.replace(",", "")))
joined_db.rename({'twenty_ten_census':'2010_census'}, axis = 1, inplace = True)
joined_db
Out[35]:
city households_with_under_18 land_area population_density total_families 2010_census total_sales
0 Buffalo 746 3115.507500 1.55 1819.50 4585 185328
1 Casper 7788 3894.309100 11.16 8756.32 35316 317736
2 Cheyenne 7158 1500.178400 20.34 14612.64 59466 917892
3 Cody 1403 2998.956960 1.82 3515.62 9520 218376
4 Douglas 832 1829.465100 1.46 1744.08 6120 208008
5 Evanston 1486 999.497100 4.95 2712.64 12359 283824
6 Gillette 4052 2748.852900 5.80 7189.43 29087 543132
7 Powell 1251 2673.574550 1.62 3134.18 6314 233928
8 Riverton 2680 4796.859815 2.34 5556.49 10615 303264
9 Rock Springs 4022 6620.201916 2.78 7572.18 23036 253584
10 Sheridan 2646 1893.977048 8.98 6039.71 17444 308232

AVERAGES

In [36]:
for column in list(joined_db.columns)[1:]:
    print(column+':', float((joined_db[column].mean())))
households_with_under_18: 3096.7272727272725
land_area: 3006.489126272727
population_density: 5.70909090909091
total_families: 5695.708181818181
2010_census: 19442.0
total_sales: 343027.63636363635
In [37]:
for column in list(joined_db.columns)[1:]:
    print(column+':', int(round(joined_db[column].sum())))
households_with_under_18: 34064
land_area: 33071
population_density: 63
total_families: 62653
2010_census: 213862
total_sales: 3773304
In [43]:
f, axes = plt.subplots(3, 2, figsize = (10, 15))

i, j = 0, 0 
for column in list(joined_db.columns[1:]):
    axes[i, j].hist(list(joined_db[column]))
    axes[i, j].set_title(column)
    j += 1
    if j == 2:
        i +=1 
        j = 0

OUTLIERS

In [44]:
f, axes = plt.subplots(3, 2, figsize = (10, 15))

i, j = 0, 0 
for column in list(joined_db.columns[1:]):
    axes[i, j].boxplot(list(joined_db[column]))
    axes[i, j].set_title(column)
    j += 1
    if j == 2:
        i +=1 
        j = 0
    

From the above histograms and boxplots we can see where there are several potentially problematic outliers in our data. Let's drill down to see what cities have features with outliers so we can get an idea of how to handle them.

In [48]:
#expected data should fall in range: median +/- 1.5*interquartilerange()


for column in list(joined_db.columns)[1:]:
    col_series = joined_db[column]
    med_value = np.median(col_series)
    print(column+':', int(round(col_series.sum())))
    
    data_range = (med_value - 1.5*iqr(col_series) , 1.5*iqr(col_series) + med_value)
    outlier_range[column] = data_range
    print('data range: ', data_range)
    
    df_outliers = col_series.apply(lambda x: x if x < data_range[0] or x> data_range[1] else False)
    print(df_outliers[df_outliers != False], '\n')
    
households_with_under_18: 30042
data range:  (-1564.0, 5696.0)
1    7788
2    7158
Name: households_with_under_18, dtype: object 

land_area: 26451
data range:  (850.0485579999995, 4572.378892)
8    4796.86
Name: land_area, dtype: object 

population_density: 60
data range:  (-6.127500000000001, 13.4175)
2    20.34
Name: population_density, dtype: object 

total_families: 55081
data range:  (-1589.9074999999993, 10662.0175)
2    14612.6
Name: total_families, dtype: object 

2010_census: 190826
data range:  (-17104.125, 40078.125)
2    59466
Name: 2010_census, dtype: object 

total_sales: 3519720
data range:  (153900.0, 433188.0)
2    917892
6    543132
Name: total_sales, dtype: object 

How to Handle the Outliers

From here we see that Chenneye (2) has the greatest number of outliers. A quick search on google finds that Chenneye is the largest city in WY, and each value roughly compares to data summarized on Wikipedia and Data USA. Next, we analyze Casper (1) households under 18. Similarly, we find that this data seems accurate and that the population tends to be older in Casper. Next, we see that Gillette (6) has a high number of sales. This seems reasonable because it is the 3rd largest city in WY. Finally, we see that the land area for Rock Springs (9) is 6620.2, which is unreasonably large (larger than Casper which is largest in WY by area). Therefore we should remove Rock Springs from our dataset.

https://en.wikipedia.org/wiki/Wyoming#Cities_and_towns

https://en.wikipedia.org/wiki/Cheyenne,_Wyoming

https://datausa.io/profile/geo/cheyenne-wy/

In [46]:
joined_db.drop(9, axis =0, inplace = True)
In [47]:
joined_db
Out[47]:
city households_with_under_18 land_area population_density total_families 2010_census total_sales
0 Buffalo 746 3115.507500 1.55 1819.50 4585 185328
1 Casper 7788 3894.309100 11.16 8756.32 35316 317736
2 Cheyenne 7158 1500.178400 20.34 14612.64 59466 917892
3 Cody 1403 2998.956960 1.82 3515.62 9520 218376
4 Douglas 832 1829.465100 1.46 1744.08 6120 208008
5 Evanston 1486 999.497100 4.95 2712.64 12359 283824
6 Gillette 4052 2748.852900 5.80 7189.43 29087 543132
7 Powell 1251 2673.574550 1.62 3134.18 6314 233928
8 Riverton 2680 4796.859815 2.34 5556.49 10615 303264
10 Sheridan 2646 1893.977048 8.98 6039.71 17444 308232

While in another case we might have tried to impute the missing value or account for the missing data in some other way, here we have few data points and imputing it might significantly skew our data. If we have the ability to find the accurate information some way, this would be preferable, but assuming we do not we should remove it from our data set.