Okay, I have decided to try and work through the book: “Solve Any Data Analysis Problem” by David Asboth (Manning). It is subtitled Eight projects that show you how. My goal/hope is to actually work through all eight projects and document what I do. Hopefully without any violation of the author’s copyright.
The author wants the reader to attempt to do the requested analysis prior to reading his approach and code for the analysis. Which would have been my approach anyway.
So, let’s start with the first one.
The idea is that you are an analyst for a UK company and you been asked to provide a comparison of customer spending for London based customers and the rest of the UK. Management wants to determine which of the UK cities are underserved and whether or not their customers are mainly located in London. You are provided with a data file containing the address for each customer and the total amount each customer has spent to-date.
Plan of Attack
Okay, for now I am going to try and build a data structue containing the information in the supplied data file. But, hopefully in a way that will better facilitate the analysis. I plan to use a pandas dataframe.
Want I want to do, is read the file into a dataframe. Then add a column for the city/town the business is in. That information being obtained from the address. I thought about manually editing the file to add the extra field. But when I saw that there are 100,000 entries in the file decided against that approach.
Get Data into a Dataframe
The file itself looks like a comma separated values (CSV) file with three fields. The company ID, their address and the total spending to-date. The address field itself is a quoted string. Which itself is a comma separated list of address fields. Unfortunately the version of the file I received had a lot of unwanted new-lines. At least, unwanted by me.
company_id,address,total_spend
1,"APARTMENT 2,
52 BEDFORD ROAD,
LONDON,
ENGLAND,
SW4 7HJ",5700
2,"107 SHERINGHAM AVENUE,
LONDON,
N14 4UJ",4700
3,"43 SUNNINGDALE,
YATE,
BRISTOL,
ENGLAND,
BS37 4HZ",5900
So, in VS Code I used a regular expression to remove the unwanted new lines. I substituted r',\n'
with r','
. That seemed to solve that. So, let’s load it into a dataframe. And let pandas tell us something about the data.
import pandas as pd
from pathlib import Path
# Start by reading in data file and seeing what I can find in the way of cities
data_fl = "addresses.csv"
a_path = Path(data_fl)
if not a_path.exists():
print(f"File, {data_fl}, not found in current directory")
addrs = pd.read_csv(a_path, sep=",", header=0)
# let's have a look at some basic info regarding the dataframe contents
print()
addrs.info()
(fin-3.11) PS R:\learn\data_analysis\chap_2> python ch2_geo.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 99032 non-null object
2 total_spend 100000 non-null int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB
Right away we can see that their are a number of empty address
fields in the dataframe. Let’s have a look at few.
# let's get a count and have a look at some of those empty address rows
print(f"nbr missing entries: {addrs.isna().sum()}")
df_na = addrs[addrs.isna().any(axis=1)]
print("\n", df_na.head())
company_id address total_spend
135 136 NaN 7300
149 150 NaN 5400
168 169 NaN 4600
201 202 NaN 5800
257 258 NaN 3900
I decided I didn’t want to throw those rows away. As it might be important to let management know how many entries were missing an address. So, I filled the empty address fields with the string No-Address
.
# let's put something informative in the empty address locations
addrs.fillna("No-Address", inplace=True)
# another look at the basic dataframe info
print()
addrs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB
A couple of things resolved.
New CSV File
I am doing the coding a bit at a time in an iterative fashion. As such, I don’t really want to go through the above steps everytime I load the data file.
So I am going to save it to a new CSV file and load that into the dataframe rather than the initial version. This new CSV will be updated whenever I add or massage the data in order to complete the requested analysis. Or, perhaps another file will be saved and used for future steps in the analysis.
# save modified dataframe to new CSV file
nwa_pth = Path("updated_addresses.csv")
addrs.to_csv(nwa_pth, index=False)
Now to refactor my code to use the new file if it exists.
# r:\learn\data_analysis\chap_2\geo.py
# code for solving data analysis problem in Chap 2 of 'Solve Any Data Analysis Problem'
# rek, 2024.01.12
import pandas as pd
from pathlib import Path
def init_df(fl_nm):
# Start by reading in data file and seeing what I can find in the way of cities1
a_path = Path(fl_nm)
if not a_path.exists():
return None
addrs = pd.read_csv(a_path, sep=",", header=0)
# let's putting something informative in the empty address locations
addrs.fillna("No-Address", inplace=True)
# save modified dataframe to new CSV file
nwa_pth = Path("updated_addresses.csv")
addrs.to_csv(nwa_pth, index=False)
return addrs
def df_info(df):
# let's have a look at some basic info regarding the dataframe contents
print()
df.info()
# let's get a count and have a look at some of those empty address rows
print(f"\nnbr missing entries: {df.isna().sum()}")
df_na = df[df.isna().any(axis=1)]
print("\n", df_na.head())
# the following is a bit of overkill as I know the file exists
upa_pth = Path("updated_addresses.csv")
if upa_pth.exists():
print("\n\tloading updated addresses CSV file")
addrs = pd.read_csv(upa_pth, sep=",", header=0)
else:
print("\n\tloading intial addresses CSV file")
addrs = init_df("addresses.csv")
df_info(addrs)
And, running it the first time.
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
loading intial addresses CSV file
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB
nbr missing entries: company_id 0
address 0
total_spend 0
dtype: int64
Empty DataFrame
Columns: [company_id, address, total_spend]
Index: []
Note, the function call replaced all the missing addresses. So, count of missing entries is zero. Then running the module a second time.
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
loading updated addresses CSV file
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB
nbr missing entries: company_id 0
address 0
total_spend 0
dtype: int64
Empty DataFrame
Columns: [company_id, address, total_spend]
Index: []
Going to remove the debug print statements, and for now the call to df_info()
. Not really needed any longer.
Getting the City/Town
Okay, now the important bit. Extracting the city/town from the address field, if it is present.
I started by looking very carefully at a number of address fields. Here’s a few of them.
# let's look at various address values
rnd_smpl = addrs.sample(n=30)
rnd_smpl.info()
pd.set_option('display.max_colwidth', None)
print(rnd_smpl['address'].to_string(index=False))
UNIT 7 THE BARN GLEBE FARM BUSINESS PARK,WESTERHAM ROAD,KESTON,KENT,BR2 6AX
GEORGE STREET,COUPAR ANGUS,BLAIRGOWRIE,PERTHSHIRE,PH13 9LU
2 The Campkins Station Road,Melbourn,Royston,Cambridgeshire,England,Sg8 6Dx
82-86 SHEEN ROAD,RICHMOND,TW9 1UF
8 PANT-Y-CELYN STREET,YSTRAD MYNACH,HENGOED,WALES,CF82 7BL
P.O. BOX 3340,ROAD TOWN,TORTOLA,VIRGIN ISLANDS, BRITISH <--
3 Bevan Hill,Chesham,Hp5 2Qs
27 NORFOLK STREET,SUNNISIDE,SUNDERLAND,TYNE AND WEAR,UNITED KINGDOM,SR1 1EE
327 CLIFTON DRIVE SOUTH,LYTHAM ST ANNES,LANCS,FY8 1HN
No-Address
DAMER HOUSE,MEADOW WAY,WICKFORD,ESSEX,SS12 9HA
3RD FLOOR WEST QUAY HOUSE,SOUTH QUAY,DOUGLAS,ISLE OF MAN,IM1 5AR
71-75 SHELTON STREET,COVENT GARDEN,LONDON,UNITED KINGDOM,WC2H 9JQ
32 Kinburn Street,London,England,Se16 6Dw
Jubilee House,East Beach,Lytham St.Annes,Lancashire,England,Fy8 5Ft
63 MILLHOUSE DRIVE,GLASGOW,SCOTLAND,G20 0UF
Virgin Islands? Don’t expect management wants that in the analysis. What about other countries and/or dependencies. I found addresses for the Channel Islands, Germany, France and Canada. I didn’t search for more.
Looking at the above and a bit of trial code, it became clear to me that there was no easy way to extract the city/town from the addresses in the format provided. And, there were the non-UK locations to account for as well.
But, it did look like all of the UK based addresses included the UK PostCode. Could I use that to determine the city/town for each entry.
Extract PostCode
Okay, still nowhere near sorting the city/town for each valid UK address. But based on the above observation, I thought I’d look at extracting any text from the address column that resembled a UK PostCode. I say resembled, because I am not trying to determine if it is in fact a valid code. I will sort that out as I continue with the analysis. Simple things first.
And it was fairly simple. Do note the capture group; that’s what gets extracted. The comma I have before the capture group does not. I am using Python’s raw string syntax (r''
) to avoid the need to escape the backslashes. As would be the case when submitting a plain Python string to the regex engine.
# let's extract UK style PostCode from address column into new column
addrs['pcode'] = ''
addrs['pcode'] = addrs['address'].str.extract(r',([a-z]{1,2}\d[a-z\d]?\s*\d[a-z]{2})', flags=re.IGNORECASE)
# for future use, covert all pcode values to upper case
addrs['pcode'] = addrs['pcode'].str.upper()
print()
addrs.info()
print("\n", addrs.head(), "\n", addrs.tail())
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
3 pcode 98332 non-null object
dtypes: int64(2), object(2)
memory usage: 3.1+ MB
company_id address total_spend pcode
0 1 APARTMENT 2,52 BEDFORD ROAD,LONDON,ENGLAND,SW4... 5700 SW4 7HJ
1 2 107 SHERINGHAM AVENUE,LONDON,N14 4UJ 4700 N14 4UJ
2 3 43 SUNNINGDALE,YATE,BRISTOL,ENGLAND,BS37 4HZ 5900 BS37 4HZ
3 4 HAWESWATER HOUSE,LINGLEY MERE BUSINESS PARK,LI... 7200 WA5 3LP
4 5 AMBERFIELD BARN HOUSE AMBER LANE,CHART SUTTON,... 4600 ME17 3SF
company_id address total_spend pcode
99995 99996 KNAPP HOUSE,SELSLEY WEST,STROUD,GLOUCESTERSHIR... 4600 GL5 5LJ
99996 99997 20 The Broadway, Hampton Court,Way, Thames Dit... 4100 KT7 0LX
99997 99998 MARLAND HOUSE,13 HUDDERSFIELD ROAD,BARNSLEY,SO... 5600 S70 2LW
99998 99999 4 MOUNT SCAR VIEW,SCHOLES,HOLMFIRTH HUDDERSFIE... 4600 HD9 1XH
99999 100000 Manningham Mills Community Center, Silk Wareho... 5100 BD9 5BE
Looks like roughly 98% of the rows in the data file have something that looks like a UK PostCode in the address field. I think this will give us a reasonable data set on which to perform the analysis. Now, how many of those belong to political entities outside the four member UK countries remains to be seen.
But, I think I will save the modified dataframe to the updated addresses CSV. Then refactor the code to prevent the above column addition from occuring again.
# let's save the changes
nwa_pth = Path("updated_addresses.csv")
addrs.to_csv(nwa_pth, index=False)
Outward Code Only
After sleeping on it, I realized that I only needed to extract the outward portion of the PostCode. As it stands, I would need to split each PostCode to get the outward code when determining the correct city/town for that code. So, going to rework the regex to reflect that decision and regenerate the column accordingly. For now won’t bother changing the column name.
addrs['pcode'] = addrs['address'].str.extract(r',([a-z]{1,2}\d[a-z\d]?)\s*\d[a-z]{2}', flags=re.IGNORECASE)
# for future use, covert all pcode values to upper case
addrs['pcode'] = addrs['pcode'].str.upper()
# let's save the changes
nwa_pth = Path("updated_addresses.csv")
addrs.to_csv(nwa_pth, index=False)
And that seems to have worked as desired.
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
3 pcode 98332 non-null object
dtypes: int64(2), object(2)
memory usage: 3.1+ MB
0 APARTMENT 2,52 BEDFORD ROAD,LONDON,ENGLAND,SW4 7HJ SW4
1 107 SHERINGHAM AVENUE,LONDON,N14 4UJ N14
2 43 SUNNINGDALE,YATE,BRISTOL,ENGLAND,BS37 4HZ BS37
3 HAWESWATER HOUSE,LINGLEY MERE BUSINESS PARK,LINGLEY GREEN AVENUE,GREAT SANKEY, WARRINGTON,WA5 3LP WA5
4 AMBERFIELD BARN HOUSE AMBER LANE,CHART SUTTON,MAIDSTONE,ENGLAND,ME17 3SF ME17
address pcode
99995 KNAPP HOUSE,SELSLEY WEST,STROUD,GLOUCESTERSHIRE,GL5 5LJ GL5
99996 20 The Broadway, Hampton Court,Way, Thames Ditton,Surrey,Kt7 0Lx KT7
99997 MARLAND HOUSE,13 HUDDERSFIELD ROAD,BARNSLEY,SOUTH YORKSHIRE,ENGLAND,S70 2LW S70
99998 4 MOUNT SCAR VIEW,SCHOLES,HOLMFIRTH HUDDERSFIELD,WEST YORKSHIRE,HD9 1XH HD9
99999 Manningham Mills Community Center, Silk Warehouse,Lilycroft Road,Bradford,United Kingdom,Bd9 5Be BD9
Done
I think that’s it for this post. I seem to be making some progress, and it feels like a good time to take a break.