Last post I added the Post town for each row in the dataset. Well, at least for those with a UK Postcode from which I could extract the outward code. The latter being used to identify the Post town from a table I downloaded from the internet.
But, that was a lot of individual towns. Over 1300 of them. And, a lot of those are going to be ones with relatively small populations. So it made me wonder about the validity of comparing total spend in all those towns against each other, especially the towns with significant populations. With no way to ask management what they are really after I decided to look at two options.
- only include towns with a significant population in the analysis
- include larger towns and regional areas in the analysis
For the latter I was planning to use the Postcode area contained in the Postcode outward code.
Postcode area
The postcode area is part of the outward code. The postcode area is either one or two characters long and is alphabetical, with there being 121 such areas. Examples of postcode areas are “L” for Liverpool, “RH” for Redhill and “EH” for Edinburgh. A postal area may cover a wide area, for example “RH” covers various settlements in eastern Surrey and north eastern West Sussex, and “BT” (Belfast) covers the whole of Northern Ireland.
Postcodes in the United Kingdom on Wikipedia
121 towns/areas sounds like a better choice than 1300 towns.
For the former I looked at a table of population by UK city found on the World Population Review.
Population of Cities in United Kingdom 2024
United Kingdom has 1 city with more than a million people, 95 cities with between 100,000 and 1 million people, and 951 cities with between 10,000 and 100,000 people.
Population of Cities in United Kingdom 2024
If we limit the analysis to towns with a population of 100,000 or more, we would only have 96 entities in our analysis. Which also strikes me as better than more than 1300.
I expect either choice will provide entities that are more truly comparable. I favour the Postcode areas approach as it includes the whole of the UK. But until we look at the data hard to say for sure. So, I plan to add the data needed for both approaches to the dataset CSV file and then see if one approach looks better than the other.
But looking at the above quotes, the population for London (which I expect means Greater London) probably covers a much larger area than what the outward code version of London I have used so far. But…
I am also going to add the appropriate country to the dataset. It may make some sense to look at them separately. Don’t know. But if already iterating through the dataset won’t cost much to add that info. And, the country was in the Postcode area table I put in the CSV file.
Add New Data
For the population data, I gave my e-mail to the site in order to get a CSV of the population data. The table on the site only showed 400 towns. Well didn’t gain much. Only 400 towns in the CSV file I received. The last one being Winsford with a population of 30259
For the postcode areas I copied the data tables from the web page into a file. Then manually edited (regexes) to produce a suitable CSV file, I can load into dictionaries or dataframes. I also had to make some assumptions for when the table had two countries for one of the areas. I went by the largest town in the area. There are also some areas with “/” separated names, for now I am leaving them as is, though I may, in future, go with the shortest name shown.
Okay, I load all the applicable CSV files (dataset and reference tables).
upa_pth = Path("updt_addr_2.csv")
if upa_pth.exists():
# print("\n\tloading updated addresses CSV file")
addrs = pd.read_csv(upa_pth, sep=",", header=0)
else:
# print("\n\tloading intial addresses CSV file")
addrs = init_df("addresses.csv")
# load population and postcode area look up tables
acodes = pd.read_csv("uk_area_country.csv", sep=",", header=0, index_col='PC area')
t_pops = pd.read_csv("pop_by_town.csv", sep=",", header=0, index_col='Town')
# quick check of dataframes
print()
acodes.info()
print()
t_pops.info()
<class 'pandas.core.frame.DataFrame'>
Index: 122 entries, AB to ZE
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 aname 122 non-null object
1 country 122 non-null object
dtypes: object(2)
<class 'pandas.core.frame.DataFrame'>
Index: 400 entries, London to Winsford
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pop204 400 non-null int64
dtypes: int64(1)
memory usage: 6.2+ KB
Add Postcode Area Code Column
Then I started the data additions by extracting the area code from the outward code (already in the CSV file) into a new column. Need that to access the Postcode area data.
# extract area code from outward code (if outward code exists)
addrs['acode'] = None
addrs['acode'] = addrs['pcode'].str.extract(r'(^[a-z]{1,2})\d[a-z\d]?$', flags=re.IGNORECASE)
print()
addrs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
3 pcode 98332 non-null object
4 town 100000 non-null object
5 acode 98332 non-null object
dtypes: int64(2), object(4)
memory usage: 4.6+ MB
Add Three New Columns
Okay, on to the real data: area name, country and town population. Taking care to deal with keys that don’t exist in one or other of the lookup tables. I am using get()
to access the lookup tables as it provides for a default value if a matching entry is not found in the dataframe/series.
# okay let's traverse the dataset dataframe and add the area name, country and population
# taking care to deal with things that don't exist in one or other of the lookup tables
addrs["aname"] = None
addrs["country"] = None
addrs["t_pop"] = 0
for ndx in addrs.index:
addrs.at[ndx, "aname"] = acodes.aname.get(addrs.at[ndx, "acode"], "No-Area")
addrs.at[ndx, "country"] = acodes.country.get(addrs.at[ndx, "acode"], "Not-UK")
addrs.at[ndx, "t_pop"] = t_pops.pop2024.get(addrs.at[ndx, "town"], None)
print()
addrs.info()
addrs.to_csv("updt_addr_2.csv", index=False)
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_id 100000 non-null int64
1 address 100000 non-null object
2 total_spend 100000 non-null int64
3 pcode 98332 non-null object
4 town 100000 non-null object
5 acode 98332 non-null object
6 aname 100000 non-null object
7 country 100000 non-null object
8 t_pop 71133 non-null float64
dtypes: float64(1), int64(2), object(6)
memory usage: 6.9+ MB
One thing I noticed when looking at the updated CSV file, is that some of the Bad-Outward
rows without a town assigned, did get an area assigned. So a slight gain in useable data rows.
Let’s Do Some Simple Comparisons of the Three Approaches
I am not going to say anything other than the code and it’s output should tell you what I did.
spend_by_town = addrs.groupby('town')['total_spend'].sum()
all_town_spend = spend_by_town.sum(skipna = True)
london_town_sp = spend_by_town.loc['London']
not_london_town = all_town_spend - london_town_sp
spend_town_sorted = spend_by_town.sort_values(ascending=False)
pop_gt_100k = addrs[addrs["t_pop"] >= 100000]
spend_city_pop = pop_gt_100k.groupby('town')['total_spend'].sum()
spend_city_pop.to_csv("spend_city_pop.csv")
all_pop_spend = spend_city_pop.sum(skipna = True)
london_pop_sp = spend_city_pop.loc['London']
not_london_pop = all_pop_spend - london_pop_sp
spend_pop_sorted = spend_city_pop.sort_values(ascending=False)
have_area = addrs[addrs["aname"] != "No-Area"]
spend_area = have_area.groupby('aname')['total_spend'].sum()
spend_area.to_csv("spend_area.csv")
all_area_spend = spend_area.sum(skipna = True)
london_area_sp = spend_area["London E":"London WC"].sum()
not_london_area = all_area_spend - london_area_sp
spend_area_sorted = spend_area.sort_values(ascending=False)
print(f"\n{'':>15} {'All 1300 Towns':>15} {'By PCode Area':>15} {'City over 100K':>15}")
print(f"{'All locations:':>15} {int(all_town_spend):>15,} {int(all_area_spend):>15,} {int(all_pop_spend):>15,}")
print(f"{'London:':>15} {int(london_town_sp):>15,} {int(london_area_sp):>15,} {int(london_pop_sp):>15,}")
print(f"{'Not London:':>15} {int(not_london_town):>15,} {int(not_london_area):>15,} {int(not_london_pop):>15,}")
# Now look at top 6 in each case
# Need to make adjustment for the area approach as London areas are 8 if top 10
# top 5 excluding london
t5_towns = spend_town_sorted.index[1:6]
t5t_sp = pd.Series({ndx: spend_town_sorted[ndx] for ndx in t5_towns})
t5_areas = spend_area_sorted.index[3:4]
t5_area_tmp = spend_area_sorted.index[6:7]
t5_areas = t5_areas.union(t5_area_tmp)
t5_area_tmp = spend_area_sorted.index[10:13]
t5_areas = t5_areas.union(t5_area_tmp)
t5a_sp = pd.Series({ndx: spend_area_sorted[ndx] for ndx in t5_areas})
t5_pop = spend_pop_sorted.index[1:6]
t5p_sp = pd.Series({ndx: spend_pop_sorted[ndx] for ndx in t5_pop})
t5_comb = t5_towns.union(t5_areas)
t5_comb = t5_comb.union(t5_pop)
print(f"\nTop 6 Locations\n")
print(f"\n{'':>10} {'All 1300 Towns':>15} {'By PCode Area':>15} {'City over 100K':>15}")
print(f"{'London':>10} {london_town_sp:>15,} {london_area_sp:>15,} {london_pop_sp:>15,}")
for ndx in t5_comb:
print(f"{ndx:>10} {int(t5t_sp.get(ndx, '0')):>15,} {int(t5a_sp.get(ndx, '0')):>15,} {int(t5p_sp.get(ndx, '0')):>15,}")
# Now bottom 6 in each case
b6_towns = spend_town_sorted.index[-6:]
b6t_sp = pd.Series({ndx: spend_town_sorted[ndx] for ndx in b6_towns})
b6_areas = spend_area_sorted.index[-6:]
b6a_sp = pd.Series({ndx: spend_area_sorted[ndx] for ndx in b6_areas})
b6_pop = spend_pop_sorted.index[-6:]
b6p_sp = pd.Series({ndx: spend_pop_sorted[ndx] for ndx in b6_pop})
b6_comb = b6_towns.union(b6_areas)
b6_comb = b6_comb.union(b6_pop)
print(f"\nBottom 6 Locations\n")
print(f"\n{'':>21} {'All 1300 Towns':>15} {'By PCode Area':>15} {'City over 100K':>15}")
for ndx in b6_comb:
print(f"{ndx:>21} {int(b6t_sp.get(ndx, '0')):>15,} {int(b6a_sp.get(ndx, '0')):>15,} {int(b6p_sp.get(ndx, '0')):>15,}")
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py
All 1300 Towns By PCode Area City over 100K
All locations: 495,166,200 485,898,300 255,598,300
London: 102,377,000 102,550,700 102,377,000
Not London: 392,789,200 383,347,600 153,221,300
Top 6 Locations
All 1300 Towns By PCode Area City over 100K
London 102,377,000 102,550,700 102,377,000
Belfast 0 7,462,500 0
Birmingham 9,264,400 15,456,800 9,264,400
Bristol 5,843,900 0 5,843,900
Glasgow 6,514,300 0 6,514,300
Harrow 0 7,785,400 0
Leeds 5,172,400 0 5,172,400
Leicester 0 7,066,600 0
Manchester 8,948,200 11,061,700 8,948,200
Bottom 6 Locations
All 1300 Towns By PCode Area City over 100K
Badminton 1,700 0 0
Bexley 0 0 214,000
Birkenhead 0 0 312,000
Boncath 2,100 0 0
Burnley 0 0 452,800
Dumfries and Galloway 0 403,900 0
Galston 500 0 0
Gillingham 0 0 564,000
Hastings 0 0 265,000
Isle Of Barra 2,800 0 0
Isle of Man 0 342,800 0
Kirkwall 0 97,000 0
Lerwick / Zetland 0 118,300 0
Llandrindod Wells 0 228,300 0
Lochearnhead 2,700 0 0
Munlochy 2,800 0 0
Outer Hebrides 0 108,100 0
West Bromwich 0 0 378,000
Done
I think this post is long enough and has taken me more than enough time to produce (coding and documenting).
I still have things I’d like to look at. E.G. looking at city spend volumes by country rather than the whole UK.
Perhaps a look at how to present the data to management. I.E. tables, charts, graphics or … Not really sure how to deal with this one.
May your coding journeys be fruitful and fun.
Resources
- pandas.Index.union
- pandas.Series.get
- How do I select a subset of a DataFrame?
- A Guide to Formatting with f-strings in Python (PDF)