Last post I added the Post town for each row in the dataset. Well, at least for those with a UK Postcode from which I could extract the outward code. The latter being used to identify the Post town from a table I downloaded from the internet.

But, that was a lot of individual towns. Over 1300 of them. And, a lot of those are going to be ones with relatively small populations. So it made me wonder about the validity of comparing total spend in all those towns against each other, especially the towns with significant populations. With no way to ask management what they are really after I decided to look at two options.

only include towns with a significant population in the analysis
include larger towns and regional areas in the analysis

For the latter I was planning to use the Postcode area contained in the Postcode outward code.

Postcode area
The postcode area is part of the outward code. The postcode area is either one or two characters long and is alphabetical, with there being 121 such areas. Examples of postcode areas are “L” for Liverpool, “RH” for Redhill and “EH” for Edinburgh. A postal area may cover a wide area, for example “RH” covers various settlements in eastern Surrey and north eastern West Sussex, and “BT” (Belfast) covers the whole of Northern Ireland.
Postcodes in the United Kingdom on Wikipedia

121 towns/areas sounds like a better choice than 1300 towns.

For the former I looked at a table of population by UK city found on the World Population Review.

Population of Cities in United Kingdom 2024
United Kingdom has 1 city with more than a million people, 95 cities with between 100,000 and 1 million people, and 951 cities with between 10,000 and 100,000 people.
Population of Cities in United Kingdom 2024

If we limit the analysis to towns with a population of 100,000 or more, we would only have 96 entities in our analysis. Which also strikes me as better than more than 1300.

I expect either choice will provide entities that are more truly comparable. I favour the Postcode areas approach as it includes the whole of the UK. But until we look at the data hard to say for sure. So, I plan to add the data needed for both approaches to the dataset CSV file and then see if one approach looks better than the other.

But looking at the above quotes, the population for London (which I expect means Greater London) probably covers a much larger area than what the outward code version of London I have used so far. But…

I am also going to add the appropriate country to the dataset. It may make some sense to look at them separately. Don’t know. But if already iterating through the dataset won’t cost much to add that info. And, the country was in the Postcode area table I put in the CSV file.

Add New Data

For the population data, I gave my e-mail to the site in order to get a CSV of the population data. The table on the site only showed 400 towns. Well didn’t gain much. Only 400 towns in the CSV file I received. The last one being Winsford with a population of 30259

For the postcode areas I copied the data tables from the web page into a file. Then manually edited (regexes) to produce a suitable CSV file, I can load into dictionaries or dataframes. I also had to make some assumptions for when the table had two countries for one of the areas. I went by the largest town in the area. There are also some areas with “/” separated names, for now I am leaving them as is, though I may, in future, go with the shortest name shown.

Okay, I load all the applicable CSV files (dataset and reference tables).

upa_pth = Path("updt_addr_2.csv")
if upa_pth.exists():
  # print("\n\tloading updated addresses CSV file")
  addrs = pd.read_csv(upa_pth, sep=",", header=0)
else:
  # print("\n\tloading intial addresses CSV file")
  addrs = init_df("addresses.csv")
# load population and postcode area look up tables
acodes = pd.read_csv("uk_area_country.csv", sep=",", header=0, index_col='PC area')
t_pops = pd.read_csv("pop_by_town.csv", sep=",", header=0, index_col='Town')

# quick check of dataframes
print()
acodes.info()
print()
t_pops.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122 entries, AB to ZE
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   aname    122 non-null    object
 1   country  122 non-null    object
dtypes: object(2)

<class 'pandas.core.frame.DataFrame'>
Index: 400 entries, London to Winsford
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   pop204  400 non-null    int64
dtypes: int64(1)
memory usage: 6.2+ KB

Add Postcode Area Code Column

Then I started the data additions by extracting the area code from the outward code (already in the CSV file) into a new column. Need that to access the Postcode area data.

# extract area code from outward code (if outward code exists)
addrs['acode'] = None
addrs['acode'] = addrs['pcode'].str.extract(r'(^[a-z]{1,2})\d[a-z\d]?$', flags=re.IGNORECASE)

print()
addrs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
 3   pcode        98332 non-null   object
 4   town         100000 non-null  object
 5   acode        98332 non-null   object
dtypes: int64(2), object(4)
memory usage: 4.6+ MB

Add Three New Columns

Okay, on to the real data: area name, country and town population. Taking care to deal with keys that don’t exist in one or other of the lookup tables. I am using get() to access the lookup tables as it provides for a default value if a matching entry is not found in the dataframe/series.

# okay let's traverse the dataset dataframe and add the area name, country and population
# taking care to deal with things that don't exist in one or other of the lookup tables
addrs["aname"] = None
addrs["country"] = None
addrs["t_pop"] = 0
for ndx in addrs.index:
  addrs.at[ndx, "aname"] = acodes.aname.get(addrs.at[ndx, "acode"], "No-Area")
  addrs.at[ndx, "country"] = acodes.country.get(addrs.at[ndx, "acode"], "Not-UK")
  addrs.at[ndx, "t_pop"] = t_pops.pop2024.get(addrs.at[ndx, "town"], None)

print()
addrs.info()

addrs.to_csv("updt_addr_2.csv", index=False)

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
 3   pcode        98332 non-null   object
 4   town         100000 non-null  object
 5   acode        98332 non-null   object
 6   aname        100000 non-null  object
 7   country      100000 non-null  object
 8   t_pop        71133 non-null   float64
dtypes: float64(1), int64(2), object(6)
memory usage: 6.9+ MB

One thing I noticed when looking at the updated CSV file, is that some of the Bad-Outward rows without a town assigned, did get an area assigned. So a slight gain in useable data rows.

Let’s Do Some Simple Comparisons of the Three Approaches

I am not going to say anything other than the code and it’s output should tell you what I did.

spend_by_town = addrs.groupby('town')['total_spend'].sum()
all_town_spend = spend_by_town.sum(skipna = True)
london_town_sp = spend_by_town.loc['London']
not_london_town = all_town_spend - london_town_sp
spend_town_sorted = spend_by_town.sort_values(ascending=False)

pop_gt_100k = addrs[addrs["t_pop"] >= 100000]
spend_city_pop = pop_gt_100k.groupby('town')['total_spend'].sum()
spend_city_pop.to_csv("spend_city_pop.csv")
all_pop_spend = spend_city_pop.sum(skipna = True)
london_pop_sp = spend_city_pop.loc['London']
not_london_pop = all_pop_spend - london_pop_sp
spend_pop_sorted = spend_city_pop.sort_values(ascending=False)

have_area = addrs[addrs["aname"] != "No-Area"]
spend_area = have_area.groupby('aname')['total_spend'].sum()
spend_area.to_csv("spend_area.csv")
all_area_spend = spend_area.sum(skipna = True)
london_area_sp = spend_area["London E":"London WC"].sum()
not_london_area = all_area_spend - london_area_sp
spend_area_sorted = spend_area.sort_values(ascending=False)

print(f"\n{'':>15} {'All 1300 Towns':>15} {'By PCode Area':>15} {'City over 100K':>15}")
print(f"{'All locations:':>15} {int(all_town_spend):>15,} {int(all_area_spend):>15,} {int(all_pop_spend):>15,}")
print(f"{'London:':>15} {int(london_town_sp):>15,} {int(london_area_sp):>15,} {int(london_pop_sp):>15,}")
print(f"{'Not London:':>15} {int(not_london_town):>15,} {int(not_london_area):>15,} {int(not_london_pop):>15,}")

# Now look at top 6 in each case
# Need to make adjustment for the area approach as London areas are 8 if top 10
# top 5 excluding london
t5_towns = spend_town_sorted.index[1:6]
t5t_sp = pd.Series({ndx: spend_town_sorted[ndx] for ndx in t5_towns})
t5_areas = spend_area_sorted.index[3:4]
t5_area_tmp = spend_area_sorted.index[6:7]
t5_areas = t5_areas.union(t5_area_tmp)
t5_area_tmp = spend_area_sorted.index[10:13]
t5_areas = t5_areas.union(t5_area_tmp)
t5a_sp = pd.Series({ndx: spend_area_sorted[ndx] for ndx in t5_areas})
t5_pop = spend_pop_sorted.index[1:6]
t5p_sp = pd.Series({ndx: spend_pop_sorted[ndx] for ndx in t5_pop})
t5_comb = t5_towns.union(t5_areas)
t5_comb = t5_comb.union(t5_pop)
print(f"\nTop 6 Locations\n")
print(f"\n{'':>10} {'All 1300 Towns':>15} {'By PCode Area':>15} {'City over 100K':>15}")
print(f"{'London':>10} {london_town_sp:>15,} {london_area_sp:>15,} {london_pop_sp:>15,}")
for ndx in t5_comb:
  print(f"{ndx:>10} {int(t5t_sp.get(ndx, '0')):>15,} {int(t5a_sp.get(ndx, '0')):>15,} {int(t5p_sp.get(ndx, '0')):>15,}")

# Now bottom 6 in each case
b6_towns = spend_town_sorted.index[-6:]
b6t_sp = pd.Series({ndx: spend_town_sorted[ndx] for ndx in b6_towns})
b6_areas = spend_area_sorted.index[-6:]
b6a_sp = pd.Series({ndx: spend_area_sorted[ndx] for ndx in b6_areas})
b6_pop = spend_pop_sorted.index[-6:]
b6p_sp = pd.Series({ndx: spend_pop_sorted[ndx] for ndx in b6_pop})
b6_comb = b6_towns.union(b6_areas)
b6_comb = b6_comb.union(b6_pop)

print(f"\nBottom 6 Locations\n")
print(f"\n{'':>21} {'All 1300 Towns':>15} {'By PCode Area':>15} {'City over 100K':>15}")
for ndx in b6_comb:
  print(f"{ndx:>21} {int(b6t_sp.get(ndx, '0')):>15,} {int(b6a_sp.get(ndx, '0')):>15,} {int(b6p_sp.get(ndx, '0')):>15,}")

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

                 All 1300 Towns   By PCode Area  City over 100K
 All locations:     495,166,200     485,898,300     255,598,300
        London:     102,377,000     102,550,700     102,377,000
    Not London:     392,789,200     383,347,600     153,221,300

Top 6 Locations

            All 1300 Towns   By PCode Area  City over 100K
    London     102,377,000     102,550,700     102,377,000
   Belfast               0       7,462,500               0
Birmingham       9,264,400      15,456,800       9,264,400
   Bristol       5,843,900               0       5,843,900
   Glasgow       6,514,300               0       6,514,300
    Harrow               0       7,785,400               0
     Leeds       5,172,400               0       5,172,400
 Leicester               0       7,066,600               0
Manchester       8,948,200      11,061,700       8,948,200

Bottom 6 Locations

                       All 1300 Towns   By PCode Area  City over 100K
            Badminton           1,700               0               0
               Bexley               0               0         214,000
           Birkenhead               0               0         312,000
              Boncath           2,100               0               0
              Burnley               0               0         452,800
Dumfries and Galloway               0         403,900               0
              Galston             500               0               0
           Gillingham               0               0         564,000
             Hastings               0               0         265,000
        Isle Of Barra           2,800               0               0
          Isle of Man               0         342,800               0
             Kirkwall               0          97,000               0
    Lerwick / Zetland               0         118,300               0
    Llandrindod Wells               0         228,300               0
         Lochearnhead           2,700               0               0
             Munlochy           2,800               0               0
       Outer Hebrides               0         108,100               0
        West Bromwich               0               0         378,000

Done

I think this post is long enough and has taken me more than enough time to produce (coding and documenting).

I still have things I’d like to look at. E.G. looking at city spend volumes by country rather than the whole UK.

Perhaps a look at how to present the data to management. I.E. tables, charts, graphics or … Not really sure how to deal with this one.

May your coding journeys be fruitful and fun.

Resources

pandas.Index.union
pandas.Series.get
How do I select a subset of a DataFrame?
A Guide to Formatting with f-strings in Python (PDF)

Too Old To Code

Data Analysis: Identifying Customer Geographies, Part III