Okay last time we got the data file supplied by management into a dataframe. Fixed the empty/missing values. Well fixed might be a misnomer. Then we extracted any outward portion of UK style PostCodes found in the address
field into a new column, pcode
. Let’s see if we can use that new column to determine a city/town for any row with a non-empty pcode
field.
Outer Code to Town
UK postcodes consist of an outward code followed by an inward code. The former consists of a PostCode area followed by a PostCode district. For our purposes, we can use the outward code to determine the destination city/town.
A web search found this page: United Kingdom ZIP Codes: Full List Of UK Postal Codes by States. A little mislabled m’thinks. But it provided a list of Post Town against outward codes. Which seemed to be exactly what I was looking for. (I later found List of postcode districts in the United Kingdom, but I had already used the data in the earlier page. Didn’t feel like reworking things. If I was a paid data analyst that opinion may have changed.)
Now unfortunately both sources were tables on a web page. I couldn’t find a CSV version of either. So some file processing ahead. I copied the table into a file, uk_towns.txt
. But because I copied a table from a web page, there were some newlines I really didn’t want. For example:
Birmingham B B1, B2, B3, B4, B5, B6, B7, B8, B9,
B10, B11, B12, B13, B14, B15, B16, B17, B18, B19, B20, B21, B23, B24, B25, B26, B27, B28, B29, B30, B31, B32, B33, B34, B35, B36, B37, B38, B40, B42, B43, B44, B45, B46, B47, B48,
B99 (West Midlands)
So within the editor (VS Code) I replaced r',\n'
with r', '
to match the formatting elsewhere in the file. Now, the file has 4 space + tab delimited fields: Post Town, Area, Postcode Districts, Former County. Looking at the sample of the file data above, what I need to do is generate a dictionary (or dataframe) that links each of the outward codes in the Postcode Districts field to the Post Town.
I will traverse the file and generate said dictionary. I will then load that into a dataframe and save it as a CSV file. The latter will allow me to reload the outward code to city data without the need to regenerate it everytime.
... ...
def make_pc_2_town(f_nm):
pc_2_t = {}
with open(f_nm, encoding='utf-8') as p2t:
# skip header row
next(p2t)
for line in p2t:
data = line.strip().split(' \t')
# print(data)
pc_outs = data[2]
outers = pc_outs.split(", ")
for pc_out in outers:
pc_2_t[pc_out] = data[0].strip().capitalize()
# break
return pc_2_t
... ...
# okay let's convert the town to postcode districts data into something more useful for the analysis
o2t = make_pc_2_town("uk_towns.txt")
out_2_town = pd.DataFrame.from_dict(o2t, orient="index", columns=["town"])
out_2_town.index.name="outer"
out_2_town.to_csv("out_2_town.csv", index=False)
# let's have a look
print()
out_2_town.info()
pd.set_option('display.max_colwidth', None)
print("\n", out_2_town.head(), "\n", out_2_town.tail())
And the output from the above is as follows.
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
<class 'pandas.core.frame.DataFrame'>
Index: 2990 entries, AB10 to ZE3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 town 2990 non-null object
dtypes: object(1)
memory usage: 46.7+ KB
town
outer
AB10 Aberdeen
AB11 Aberdeen
AB12 Aberdeen
AB15 Aberdeen
AB16 Aberdeen
town
outer
YO22 Whitby
YO25 Driffield
ZE1 Shetland
ZE2 Shetland
ZE3 Shetland
I am going to refactor my code to remove the conversion and load the data from the CSV file that was generated.
Bit of an issue, should have used str.title()
instead of str.capitalize()
in that function. The second name in the following should likely be capitalized.
B64,Cradley heath
B65,Rowley regis
Rather than recreate everything, I will just try changing it in the dataframe loaded from the CSV file and then save it again.
# Bit of an issue, should have used `str.title()` instead of `str.capitalize()`
# Dont' want to rerun function, so alter dataframe and resave to CSV
out_2_town.town = out_2_town.town.str.title()
out_2_town.to_csv("out_2_town.csv", index=False)
And that seemed to work just fine.
B64,Cradley Heath
B65,Rowley Regis
Add Town to Data Dataframe and CSV
Okay, will add empty column to dataframe. Then, the idea is to iterate over the data rows in the dataframe checking for a few different situations. Namely:
- if address field is No-Address, set town to No-Town
- if no outward code,
- if address contains a non-UK political entity, set town to Not-UK
- otherwise set town to No-PCode
- if outward code, get town from
out_2_town
dataframe- if outward code not found in
out-2_town
, set town to Bad-Outward and record the outward code in a set
- if outward code not found in
I started by coding the above with a limited number of non-UK entities. Then I went through all the rows marked as “No-PCode” to find any non-UK entities I was not yet checking for. Took quite sometime and was a huge list.
I was originally using an if/elif block to check for non-UK countries. But with all the new ones I was finding I put the names in a set and wrote a small function to check an address for any of the entities, returning an appropriate value. Resulted in the removal of a lot of lines of code. Here’s more or less what I ended up with. Do note, it takes some time to run.
While working on this code, I was writing the new dataframe to a different CSV file. I will leave that as is and when I refactor the code I will load the new CSV file instead of the older one.
... ...
non_uk = {
",australia", ",austria",
",bahamas", ",barbados", ",belgium", ",belize", ",bermuda",
",canada,", ",cayman islands", ",channel isl", "china", ",curacao", ",cyprus", ",czech republic",
",denmark",
",finland", ",france",
",germany", ",gibraltar", ",greece", "grenadines",
",hong kong",
",india", ",ireland", ",israel", "istanbul", ",italy",
",japan",
",kazakhstan",
",liberia", ",liechtenstein", ",lithuania", ",luxembourg",
",malawi", ",malaysia", ",malta", ",marshall islands", ",mauritius", ",monaco",
"netherlands", ",new zealand", ",nigeria", "norway",
",pakistan", ",panama", ",poland", ",portugal", ",qatar",
",romania", ",russia",
",saint kitts and nevis", ",samoa", ",serbia", ",seychelles", ",sierra leone",
",singapore", ",slovenia", ",south korea", ",spain", ",sri lanka", ",sweden", ",switzerland"
",turkey", "turks and caicos",
",united arab emirates", ",united states", ",uruguay", ",usa", ",virgin island",
",zimbabwe",
}
def chk_4_not_uk(s_addr):
for s_ctry in non_uk:
if s_ctry in s_addr.lower():
return "Not-UK"
return "No-PCode"
... ...
missing = set()
addrs['town'] = None
for ndx in addrs.index:
if addrs.at[ndx, "address"] == "No-Address":
addrs.at[ndx, "town"] = "No-Town"
continue
elif pd.isnull(addrs.at[ndx, "pcode"]):
addrs.at[ndx, "town"] = chk_4_not_uk(addrs.at[ndx, "address"])
continue
try:
addrs.at[ndx, "town"] = out_2_town.at[addrs.loc[ndx, "pcode"], "town"]
except:
addrs.at[ndx, "town"] = "Bad-Outward"
missing.add(addrs.loc[ndx, "pcode"])
continue
addrs.to_csv("updt_addr_2.csv", index=False)
print()
addrs.info()
pd.set_option('display.max_colwidth', None)
print("\n", addrs[["pcode", "town"]].head(), "\n", addrs[["pcode", "town"]].tail())
print()
print(f"missing: {missing}")
Using at()
as recommended in the pandas documentation. For this situation it is apparently faster than loc()
.
The output from the above is as follows.
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
3 pcode 98332 non-null object
4 town 100000 non-null object
dtypes: int64(2), object(3)
memory usage: 3.8+ MB
pcode town
0 SW4 London
1 N14 London
2 BS37 Bristol
3 WA5 Warrington
4 ME17 Maidstone
pcode town
99995 GL5 Stroud
99996 KT7 Thames Ditton
99997 S70 Barnsley
99998 HD9 Holmfirth
99999 BD9 Bradford
missing: {'CF38', 'X9', 'W1R', 'EC2', 'BS12', 'RG11', 'W1Y', 'EC1', 'EC29', 'W1', 'S38', 'WD17', 'W1N', 'CF1', 'D0', 'BN55', 'RG3', 'WC2', 'YO4', 'GZ11', 'EC8H', 'BS18', 'EC21', 'N01', 'YO6', 'NP9', 'NP1', 'IG18', 'EC28', 'WC1', 'W1X', 'W1V', 'EC3', 'CF8', 'PE18', 'CF6', 'W15', 'W1M'}
Note that every row in the dataframe has a value for town
.
Some example rows selected to show the various situations.
66182,"560 LONDON ROAD,BRANDSHILL,SLOUGH,BERKSHIRE, SL3 8QF",6500,,No-PCode
66183,"14 David Mews,London,England,W1U 6Eq",3600,W1U,London
66184,"57 WELLINGTON STREET,ABERDEEN,SCOTLAND,AB11 5BX",4200,AB11,Aberdeen
66201,No-Address,6500,,No-Town
66669,"PRINSENHOF BUILDING,KONINGIN WILHELMINAPLEIN 30,AMSTERDAM,1062KR,NETHERLANDS",2600,,Not-UK
96286,"6 6 GEDNEY WAY,HUNTINGDONSHIRE,CAMBRIDGESHIRE,ENGLAND",4000,,No-PCode
96544,"TOP FLOOR MILL COURT,LA CHARROTERIE,ST PETER PORT,GUERNSEY,GY1 1EJ",5200,GY1,Not-UK
96666,"Lee Valley Technopark,Ashley Road,Tottenham,London N179Ln",7000,,No-PCode
Obviously my regex fails in a number of cases. But, I don’t think it would be necessary to account for all those cases like I did for the non-UK countries.
Bug
Bit of a bug that needed fixing. In the outward code to town CSV file there are duplicate entries for some of the outward codes for Guernsey and Jersey. This caused some strange things to be written to the town field in the updated data file. Fixed that.
Total Company Spend by Town
Okay on to the next step. Let’s get what might initially be considered the goal of this whole analysis: the total of company spending by city/town. Seems like the perfect use for groupby
and sum
.
# okay let's get a spend total for each city/town
spend_by_town = addrs.groupby('town')['total_spend'].sum()
spend_by_town.to_csv("spend_by_town.csv")
# let's see what we have
print()
spend_by_town.info()
pd.set_option('display.max_colwidth', None)
print("\n", spend_by_town.head(), "\n\n", spend_by_town.tail())
And, the terminal output is as follows.
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
<class 'pandas.core.series.Series'>
Index: 1331 entries, Abbots Langley to Ystrad Meurig
Series name: total_spend
Non-Null Count Dtype
-------------- -----
1331 non-null int64
dtypes: int64(1)
memory usage: 20.8+ KB
town
Abbots Langley 43000
Aberaeron 8300
Aberdare 108500
Aberdeen 1322800
Aberdyfi 8300
Name: total_spend, dtype: int64
town
Yateley 60700
Yelverton 29700
Yeovil 239000
York 1760400
Ystrad Meurig 8500
Name: total_spend, dtype: int64
Let’s sort the series on the total spend and have another look.
# let's sort on total_spend
spend_sorted = spend_by_town.sort_values(ascending=False)
print()
print(spend_sorted.head(10))
print()
print(spend_sorted.tail())
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
town
London 102377000
Birmingham 9264400
Manchester 8948200
Glasgow 6514300
Bristol 5843900
Leeds 5172400
Edinburgh 4937600
No-Town 4792100
Leicester 4585300
Nottingham 4277900
Name: total_spend, dtype: int64
town
Isle Of Barra 2800
Lochearnhead 2700
Boncath 2100
Badminton 1700
Galston 500
Name: total_spend, dtype: int64
Now What
I was going to sort the London and not London totals. As well as some number of the cities/towns with the lowest total spend. But looking at the preceding output there are 1331 entries in the series. Though some are not relevant for our purposes; E.G. Not-UK.
And some of those have such small populations that expending energy to get more spend is likely going to be a wasted endeavour. Now since I am trying to sort this project without reading any more of the book’s chapter and I don’t have any management to ask for more details about their desires, I am going to make some decisions on their behalf.
I need some way to sort out big enough to be of effort and/or combine towns into regions that are comparable to the larger cities. Will think about that, do some research and tackle it in the next post.
I guess that means this post is done. You know, I am quite enjoying this endeavour. Whether I ever do all the book’s projects is up in the air, but this does have my interest. And am looking forward to what the author suggests/does once I read the rest of the chapter.
Hope you are enjoying whatever you are currently coding.
Resources
- Group by: split-apply-combine
- pandas.DataFrame.at
- pandas.DataFrame.sum
- pandas.Series.sort_values