Okay last time we got the data file supplied by management into a dataframe. Fixed the empty/missing values. Well fixed might be a misnomer. Then we extracted any outward portion of UK style PostCodes found in the address field into a new column, pcode. Let’s see if we can use that new column to determine a city/town for any row with a non-empty pcode field.

Outer Code to Town

UK postcodes consist of an outward code followed by an inward code. The former consists of a PostCode area followed by a PostCode district. For our purposes, we can use the outward code to determine the destination city/town.

A web search found this page: United Kingdom ZIP Codes: Full List Of UK Postal Codes by States. A little mislabled m’thinks. But it provided a list of Post Town against outward codes. Which seemed to be exactly what I was looking for. (I later found List of postcode districts in the United Kingdom, but I had already used the data in the earlier page. Didn’t feel like reworking things. If I was a paid data analyst that opinion may have changed.)

Now unfortunately both sources were tables on a web page. I couldn’t find a CSV version of either. So some file processing ahead. I copied the table into a file, uk_towns.txt. But because I copied a table from a web page, there were some newlines I really didn’t want. For example:

Birmingham 	B 	B1, B2, B3, B4, B5, B6, B7, B8, B9,
B10, B11, B12, B13, B14, B15, B16, B17, B18, B19, B20, B21, B23, B24, B25, B26, B27, B28, B29, B30, B31, B32, B33, B34, B35, B36, B37, B38, B40, B42, B43, B44, B45, B46, B47, B48,
B99 	(West Midlands)

So within the editor (VS Code) I replaced r',\n' with r', ' to match the formatting elsewhere in the file. Now, the file has 4 space + tab delimited fields: Post Town, Area, Postcode Districts, Former County. Looking at the sample of the file data above, what I need to do is generate a dictionary (or dataframe) that links each of the outward codes in the Postcode Districts field to the Post Town.

I will traverse the file and generate said dictionary. I will then load that into a dataframe and save it as a CSV file. The latter will allow me to reload the outward code to city data without the need to regenerate it everytime.

... ...
def make_pc_2_town(f_nm):
  pc_2_t = {}
  with open(f_nm, encoding='utf-8') as p2t:
    # skip header row
    next(p2t)
    for line in p2t:
      data = line.strip().split(' \t')
      # print(data)
      pc_outs = data[2]
      outers = pc_outs.split(", ")
      for pc_out in outers:
        pc_2_t[pc_out] = data[0].strip().capitalize()
      # break
  return pc_2_t

... ...

# okay let's convert the town to postcode districts data into something more useful for the analysis
o2t = make_pc_2_town("uk_towns.txt")
out_2_town = pd.DataFrame.from_dict(o2t, orient="index", columns=["town"])
out_2_town.index.name="outer"
out_2_town.to_csv("out_2_town.csv", index=False)

# let's have a look
print()
out_2_town.info()
pd.set_option('display.max_colwidth', None)
print("\n", out_2_town.head(), "\n", out_2_town.tail())

And the output from the above is as follows.

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

<class 'pandas.core.frame.DataFrame'>
Index: 2990 entries, AB10 to ZE3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   town    2990 non-null   object
dtypes: object(1)
memory usage: 46.7+ KB

            town
outer
AB10   Aberdeen
AB11   Aberdeen
AB12   Aberdeen
AB15   Aberdeen
AB16   Aberdeen
             town
outer
YO22      Whitby
YO25   Driffield
ZE1     Shetland
ZE2     Shetland
ZE3     Shetland

I am going to refactor my code to remove the conversion and load the data from the CSV file that was generated.

Bit of an issue, should have used str.title() instead of str.capitalize() in that function. The second name in the following should likely be capitalized.

B64,Cradley heath
B65,Rowley regis

Rather than recreate everything, I will just try changing it in the dataframe loaded from the CSV file and then save it again.

# Bit of an issue, should have used `str.title()` instead of `str.capitalize()`
# Dont' want to rerun function, so alter dataframe and resave to CSV
out_2_town.town = out_2_town.town.str.title()
out_2_town.to_csv("out_2_town.csv", index=False)

And that seemed to work just fine.

B64,Cradley Heath
B65,Rowley Regis

Add Town to Data Dataframe and CSV

Okay, will add empty column to dataframe. Then, the idea is to iterate over the data rows in the dataframe checking for a few different situations. Namely:

if address field is No-Address, set town to No-Town
if no outward code,
- if address contains a non-UK political entity, set town to Not-UK
- otherwise set town to No-PCode
if outward code, get town from out_2_town dataframe
- if outward code not found in out-2_town, set town to Bad-Outward and record the outward code in a set

I started by coding the above with a limited number of non-UK entities. Then I went through all the rows marked as “No-PCode” to find any non-UK entities I was not yet checking for. Took quite sometime and was a huge list.

I was originally using an if/elif block to check for non-UK countries. But with all the new ones I was finding I put the names in a set and wrote a small function to check an address for any of the entities, returning an appropriate value. Resulted in the removal of a lot of lines of code. Here’s more or less what I ended up with. Do note, it takes some time to run.

While working on this code, I was writing the new dataframe to a different CSV file. I will leave that as is and when I refactor the code I will load the new CSV file instead of the older one.


... ...

non_uk = {
  ",australia", ",austria",
  ",bahamas", ",barbados", ",belgium", ",belize", ",bermuda",
  ",canada,", ",cayman islands", ",channel isl", "china", ",curacao", ",cyprus", ",czech republic",
  ",denmark",
  ",finland", ",france",
  ",germany", ",gibraltar", ",greece", "grenadines",
  ",hong kong",
  ",india", ",ireland", ",israel", "istanbul", ",italy",
  ",japan",
  ",kazakhstan",
  ",liberia", ",liechtenstein", ",lithuania", ",luxembourg",
  ",malawi", ",malaysia", ",malta", ",marshall islands", ",mauritius", ",monaco",
  "netherlands", ",new zealand", ",nigeria", "norway",
  ",pakistan", ",panama", ",poland", ",portugal", ",qatar",
  ",romania", ",russia",
  ",saint kitts and nevis", ",samoa", ",serbia", ",seychelles", ",sierra leone",
  ",singapore", ",slovenia", ",south korea", ",spain", ",sri lanka", ",sweden", ",switzerland"
  ",turkey", "turks and caicos",
  ",united arab emirates", ",united states", ",uruguay", ",usa", ",virgin island",
  ",zimbabwe",
}


def chk_4_not_uk(s_addr):
  for s_ctry in non_uk:
    if s_ctry in s_addr.lower():
      return "Not-UK"
  return "No-PCode"

... ...

missing = set()
addrs['town'] = None
for ndx in addrs.index:
  if addrs.at[ndx, "address"] == "No-Address":
    addrs.at[ndx, "town"] = "No-Town"
    continue
  elif pd.isnull(addrs.at[ndx, "pcode"]):
    addrs.at[ndx, "town"] = chk_4_not_uk(addrs.at[ndx, "address"])
    continue
  try:
    addrs.at[ndx, "town"] = out_2_town.at[addrs.loc[ndx, "pcode"], "town"]
  except:
    addrs.at[ndx, "town"] = "Bad-Outward"
    missing.add(addrs.loc[ndx, "pcode"])
    continue

addrs.to_csv("updt_addr_2.csv", index=False)

print()
addrs.info()
pd.set_option('display.max_colwidth', None)
print("\n", addrs[["pcode", "town"]].head(), "\n", addrs[["pcode", "town"]].tail())
print()
print(f"missing: {missing}")

Using at() as recommended in the pandas documentation. For this situation it is apparently faster than loc().

The output from the above is as follows.

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
 3   pcode        98332 non-null   object
 4   town         100000 non-null  object
dtypes: int64(2), object(3)
memory usage: 3.8+ MB

   pcode        town
0   SW4      London
1   N14      London
2  BS37     Bristol
3   WA5  Warrington
4  ME17   Maidstone

       pcode           town
99995   GL5         Stroud
99996   KT7  Thames Ditton
99997   S70       Barnsley
99998   HD9      Holmfirth
99999   BD9       Bradford

missing: {'CF38', 'X9', 'W1R', 'EC2', 'BS12', 'RG11', 'W1Y', 'EC1', 'EC29', 'W1', 'S38', 'WD17', 'W1N', 'CF1', 'D0', 'BN55', 'RG3', 'WC2', 'YO4', 'GZ11', 'EC8H', 'BS18', 'EC21', 'N01', 'YO6', 'NP9', 'NP1', 'IG18', 'EC28', 'WC1', 'W1X', 'W1V', 'EC3', 'CF8', 'PE18', 'CF6', 'W15', 'W1M'}

Note that every row in the dataframe has a value for town.

Some example rows selected to show the various situations.

66182,"560 LONDON ROAD,BRANDSHILL,SLOUGH,BERKSHIRE,  SL3  8QF",6500,,No-PCode
66183,"14 David Mews,London,England,W1U 6Eq",3600,W1U,London
66184,"57 WELLINGTON STREET,ABERDEEN,SCOTLAND,AB11 5BX",4200,AB11,Aberdeen
66201,No-Address,6500,,No-Town
66669,"PRINSENHOF BUILDING,KONINGIN WILHELMINAPLEIN 30,AMSTERDAM,1062KR,NETHERLANDS",2600,,Not-UK
96286,"6 6 GEDNEY WAY,HUNTINGDONSHIRE,CAMBRIDGESHIRE,ENGLAND",4000,,No-PCode
96544,"TOP FLOOR MILL COURT,LA CHARROTERIE,ST PETER PORT,GUERNSEY,GY1 1EJ",5200,GY1,Not-UK
96666,"Lee Valley Technopark,Ashley Road,Tottenham,London N179Ln",7000,,No-PCode

Obviously my regex fails in a number of cases. But, I don’t think it would be necessary to account for all those cases like I did for the non-UK countries.

Bug

Bit of a bug that needed fixing. In the outward code to town CSV file there are duplicate entries for some of the outward codes for Guernsey and Jersey. This caused some strange things to be written to the town field in the updated data file. Fixed that.

Total Company Spend by Town

Okay on to the next step. Let’s get what might initially be considered the goal of this whole analysis: the total of company spending by city/town. Seems like the perfect use for groupby and sum.

# okay let's get a spend total for each city/town
spend_by_town = addrs.groupby('town')['total_spend'].sum()
spend_by_town.to_csv("spend_by_town.csv")

# let's see what we have
print()
spend_by_town.info()
pd.set_option('display.max_colwidth', None)
print("\n", spend_by_town.head(), "\n\n", spend_by_town.tail())

And, the terminal output is as follows.

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

<class 'pandas.core.series.Series'>
Index: 1331 entries, Abbots Langley to Ystrad Meurig
Series name: total_spend
Non-Null Count  Dtype
--------------  -----
1331 non-null   int64
dtypes: int64(1)
memory usage: 20.8+ KB

 town
Abbots Langley      43000
Aberaeron            8300
Aberdare           108500
Aberdeen          1322800
Aberdyfi             8300
Name: total_spend, dtype: int64
 town
Yateley            60700
Yelverton          29700
Yeovil            239000
York             1760400
Ystrad Meurig       8500
Name: total_spend, dtype: int64

Let’s sort the series on the total spend and have another look.

# let's sort on total_spend
spend_sorted = spend_by_town.sort_values(ascending=False)
print()
print(spend_sorted.head(10))
print()
print(spend_sorted.tail())

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

town
London        102377000
Birmingham      9264400
Manchester      8948200
Glasgow         6514300
Bristol         5843900
Leeds           5172400
Edinburgh       4937600
No-Town         4792100
Leicester       4585300
Nottingham      4277900
Name: total_spend, dtype: int64

town
Isle Of Barra    2800
Lochearnhead     2700
Boncath          2100
Badminton        1700
Galston           500
Name: total_spend, dtype: int64

Now What

I was going to sort the London and not London totals. As well as some number of the cities/towns with the lowest total spend. But looking at the preceding output there are 1331 entries in the series. Though some are not relevant for our purposes; E.G. Not-UK.

And some of those have such small populations that expending energy to get more spend is likely going to be a wasted endeavour. Now since I am trying to sort this project without reading any more of the book’s chapter and I don’t have any management to ask for more details about their desires, I am going to make some decisions on their behalf.

I need some way to sort out big enough to be of effort and/or combine towns into regions that are comparable to the larger cities. Will think about that, do some research and tackle it in the next post.

I guess that means this post is done. You know, I am quite enjoying this endeavour. Whether I ever do all the book’s projects is up in the air, but this does have my interest. And am looking forward to what the author suggests/does once I read the rest of the chapter.

Hope you are enjoying whatever you are currently coding.

Resources

Group by: split-apply-combine
pandas.DataFrame.at
pandas.DataFrame.sum
pandas.Series.sort_values

Too Old To Code

Data Analysis: Identifying Customer Geographies, Part II

Outer Code to Town

Add Town to Data Dataframe and CSV

Bug

Total Company Spend by Town

Now What

Resources