Okay, I have decided to try and work through the book: “Solve Any Data Analysis Problem” by David Asboth (Manning). It is subtitled Eight projects that show you how. My goal/hope is to actually work through all eight projects and document what I do. Hopefully without any violation of the author’s copyright.

The author wants the reader to attempt to do the requested analysis prior to reading his approach and code for the analysis. Which would have been my approach anyway.

So, let’s start with the first one.

The idea is that you are an analyst for a UK company and you been asked to provide a comparison of customer spending for London based customers and the rest of the UK. Management wants to determine which of the UK cities are underserved and whether or not their customers are mainly located in London. You are provided with a data file containing the address for each customer and the total amount each customer has spent to-date.

Plan of Attack

Okay, for now I am going to try and build a data structue containing the information in the supplied data file. But, hopefully in a way that will better facilitate the analysis. I plan to use a pandas dataframe.

Want I want to do, is read the file into a dataframe. Then add a column for the city/town the business is in. That information being obtained from the address. I thought about manually editing the file to add the extra field. But when I saw that there are 100,000 entries in the file decided against that approach.

Get Data into a Dataframe

The file itself looks like a comma separated values (CSV) file with three fields. The company ID, their address and the total spending to-date. The address field itself is a quoted string. Which itself is a comma separated list of address fields. Unfortunately the version of the file I received had a lot of unwanted new-lines. At least, unwanted by me.

company_id,address,total_spend
1,"APARTMENT 2,
52 BEDFORD ROAD,
LONDON,
ENGLAND,
SW4 7HJ",5700
2,"107 SHERINGHAM AVENUE,
LONDON,
N14 4UJ",4700
3,"43 SUNNINGDALE,
YATE,
BRISTOL,
ENGLAND,
BS37 4HZ",5900

So, in VS Code I used a regular expression to remove the unwanted new lines. I substituted r',\n' with r','. That seemed to solve that. So, let’s load it into a dataframe. And let pandas tell us something about the data.

import pandas as pd
from pathlib import Path

# Start by reading in data file and seeing what I can find in the way of cities
data_fl = "addresses.csv"
a_path = Path(data_fl)
if not a_path.exists():
  print(f"File, {data_fl}, not found in current directory")
addrs = pd.read_csv(a_path, sep=",", header=0)

# let's have a look at some basic info regarding the dataframe contents
print()
addrs.info()
(fin-3.11) PS R:\learn\data_analysis\chap_2> python ch2_geo.py

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      99032 non-null   object
 2   total_spend  100000 non-null  int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB

Right away we can see that their are a number of empty address fields in the dataframe. Let’s have a look at few.

# let's get a count and have a look at some of those empty address rows
print(f"nbr missing entries: {addrs.isna().sum()}")
df_na = addrs[addrs.isna().any(axis=1)]
print("\n", df_na.head())
      company_id address  total_spend
135         136     NaN         7300
149         150     NaN         5400
168         169     NaN         4600
201         202     NaN         5800
257         258     NaN         3900

I decided I didn’t want to throw those rows away. As it might be important to let management know how many entries were missing an address. So, I filled the empty address fields with the string No-Address.

# let's put something informative in the empty address locations
addrs.fillna("No-Address", inplace=True)
# another look at the basic dataframe info
print()
addrs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB

A couple of things resolved.

New CSV File

I am doing the coding a bit at a time in an iterative fashion. As such, I don’t really want to go through the above steps everytime I load the data file.

So I am going to save it to a new CSV file and load that into the dataframe rather than the initial version. This new CSV will be updated whenever I add or massage the data in order to complete the requested analysis. Or, perhaps another file will be saved and used for future steps in the analysis.

# save modified dataframe to new CSV file
nwa_pth = Path("updated_addresses.csv")
addrs.to_csv(nwa_pth, index=False)

Now to refactor my code to use the new file if it exists.

# r:\learn\data_analysis\chap_2\geo.py
# code for solving data analysis problem in Chap 2 of 'Solve Any Data Analysis Problem'
# rek, 2024.01.12

import pandas as pd
from pathlib import Path


def init_df(fl_nm):
  # Start by reading in data file and seeing what I can find in the way of cities1
  a_path = Path(fl_nm)
  if not a_path.exists():
    return None
  addrs = pd.read_csv(a_path, sep=",", header=0)
  # let's putting something informative in the empty address locations
  addrs.fillna("No-Address", inplace=True)
  # save modified dataframe to new CSV file
  nwa_pth = Path("updated_addresses.csv")
  addrs.to_csv(nwa_pth, index=False)

  return addrs


def df_info(df):
  # let's have a look at some basic info regarding the dataframe contents
  print()
  df.info()
  # let's get a count and have a look at some of those empty address rows
  print(f"\nnbr missing entries: {df.isna().sum()}")
  df_na = df[df.isna().any(axis=1)]
  print("\n", df_na.head())


# the following is a bit of overkill as I know the file exists
upa_pth = Path("updated_addresses.csv")
if upa_pth.exists():
  print("\n\tloading updated addresses CSV file")
  addrs = pd.read_csv(upa_pth, sep=",", header=0)
else:
  print("\n\tloading intial addresses CSV file")
  addrs = init_df("addresses.csv")

df_info(addrs)

And, running it the first time.

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

        loading intial addresses CSV file

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB

nbr missing entries: company_id     0
address        0
total_spend    0
dtype: int64

 Empty DataFrame
Columns: [company_id, address, total_spend]
Index: []

Note, the function call replaced all the missing addresses. So, count of missing entries is zero. Then running the module a second time.

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

        loading updated addresses CSV file

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
dtypes: int64(2), object(1)
memory usage: 2.3+ MB

nbr missing entries: company_id     0
address        0
total_spend    0
dtype: int64

 Empty DataFrame
Columns: [company_id, address, total_spend]
Index: []

Going to remove the debug print statements, and for now the call to df_info(). Not really needed any longer.

Getting the City/Town

Okay, now the important bit. Extracting the city/town from the address field, if it is present.

I started by looking very carefully at a number of address fields. Here’s a few of them.

# let's look at various address values
rnd_smpl = addrs.sample(n=30)
rnd_smpl.info()
pd.set_option('display.max_colwidth', None)
print(rnd_smpl['address'].to_string(index=False))
UNIT 7 THE BARN GLEBE FARM BUSINESS PARK,WESTERHAM ROAD,KESTON,KENT,BR2 6AX
                 GEORGE STREET,COUPAR ANGUS,BLAIRGOWRIE,PERTHSHIRE,PH13 9LU
2 The Campkins Station Road,Melbourn,Royston,Cambridgeshire,England,Sg8 6Dx
                                          82-86 SHEEN ROAD,RICHMOND,TW9 1UF
                 8 PANT-Y-CELYN STREET,YSTRAD MYNACH,HENGOED,WALES,CF82 7BL
                    P.O. BOX 3340,ROAD TOWN,TORTOLA,VIRGIN ISLANDS, BRITISH  <--
                                               3 Bevan Hill,Chesham,Hp5 2Qs
27 NORFOLK STREET,SUNNISIDE,SUNDERLAND,TYNE AND WEAR,UNITED KINGDOM,SR1 1EE
                      327 CLIFTON DRIVE SOUTH,LYTHAM ST ANNES,LANCS,FY8 1HN
                                                                 No-Address
                             DAMER HOUSE,MEADOW WAY,WICKFORD,ESSEX,SS12 9HA
           3RD FLOOR WEST QUAY HOUSE,SOUTH QUAY,DOUGLAS,ISLE OF MAN,IM1 5AR
          71-75 SHELTON STREET,COVENT GARDEN,LONDON,UNITED KINGDOM,WC2H 9JQ
                                  32 Kinburn Street,London,England,Se16 6Dw
        Jubilee House,East Beach,Lytham St.Annes,Lancashire,England,Fy8 5Ft
                                63 MILLHOUSE DRIVE,GLASGOW,SCOTLAND,G20 0UF

Virgin Islands? Don’t expect management wants that in the analysis. What about other countries and/or dependencies. I found addresses for the Channel Islands, Germany, France and Canada. I didn’t search for more.

Looking at the above and a bit of trial code, it became clear to me that there was no easy way to extract the city/town from the addresses in the format provided. And, there were the non-UK locations to account for as well.

But, it did look like all of the UK based addresses included the UK PostCode. Could I use that to determine the city/town for each entry.

Extract PostCode

Okay, still nowhere near sorting the city/town for each valid UK address. But based on the above observation, I thought I’d look at extracting any text from the address column that resembled a UK PostCode. I say resembled, because I am not trying to determine if it is in fact a valid code. I will sort that out as I continue with the analysis. Simple things first.

And it was fairly simple. Do note the capture group; that’s what gets extracted. The comma I have before the capture group does not. I am using Python’s raw string syntax (r'') to avoid the need to escape the backslashes. As would be the case when submitting a plain Python string to the regex engine.

# let's extract UK style PostCode from address column into new column
addrs['pcode'] = ''
addrs['pcode'] = addrs['address'].str.extract(r',([a-z]{1,2}\d[a-z\d]?\s*\d[a-z]{2})', flags=re.IGNORECASE)
# for future use, covert all pcode values to upper case
addrs['pcode'] = addrs['pcode'].str.upper()
print()
addrs.info()
print("\n", addrs.head(), "\n", addrs.tail())
(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
 3   pcode        98332 non-null   object
dtypes: int64(2), object(2)
memory usage: 3.1+ MB

    company_id                                            address  total_spend     pcode
0           1  APARTMENT 2,52 BEDFORD ROAD,LONDON,ENGLAND,SW4...         5700   SW4 7HJ
1           2               107 SHERINGHAM AVENUE,LONDON,N14 4UJ         4700   N14 4UJ
2           3       43 SUNNINGDALE,YATE,BRISTOL,ENGLAND,BS37 4HZ         5900  BS37 4HZ
3           4  HAWESWATER HOUSE,LINGLEY MERE BUSINESS PARK,LI...         7200   WA5 3LP
4           5  AMBERFIELD BARN HOUSE AMBER LANE,CHART SUTTON,...         4600  ME17 3SF

        company_id                                            address  total_spend    pcode
99995       99996  KNAPP HOUSE,SELSLEY WEST,STROUD,GLOUCESTERSHIR...         4600  GL5 5LJ
99996       99997  20 The Broadway, Hampton Court,Way, Thames Dit...         4100  KT7 0LX
99997       99998  MARLAND HOUSE,13 HUDDERSFIELD ROAD,BARNSLEY,SO...         5600  S70 2LW
99998       99999  4 MOUNT SCAR VIEW,SCHOLES,HOLMFIRTH HUDDERSFIE...         4600  HD9 1XH
99999      100000  Manningham Mills Community Center, Silk Wareho...         5100  BD9 5BE

Looks like roughly 98% of the rows in the data file have something that looks like a UK PostCode in the address field. I think this will give us a reasonable data set on which to perform the analysis. Now, how many of those belong to political entities outside the four member UK countries remains to be seen.

But, I think I will save the modified dataframe to the updated addresses CSV. Then refactor the code to prevent the above column addition from occuring again.

# let's save the changes
nwa_pth = Path("updated_addresses.csv")
addrs.to_csv(nwa_pth, index=False)

Outward Code Only

After sleeping on it, I realized that I only needed to extract the outward portion of the PostCode. As it stands, I would need to split each PostCode to get the outward code when determining the correct city/town for that code. So, going to rework the regex to reflect that decision and regenerate the column accordingly. For now won’t bother changing the column name.

addrs['pcode'] = addrs['address'].str.extract(r',([a-z]{1,2}\d[a-z\d]?)\s*\d[a-z]{2}', flags=re.IGNORECASE)
# for future use, covert all pcode values to upper case
addrs['pcode'] = addrs['pcode'].str.upper()
# let's save the changes
nwa_pth = Path("updated_addresses.csv")
addrs.to_csv(nwa_pth, index=False)

And that seems to have worked as desired.

(fin-3.11) PS R:\learn\data_analysis\chap_2\blog> python ch2_geo.py

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   company_id   100000 non-null  int64
 1   address      100000 non-null  object
 2   total_spend  100000 non-null  int64
 3   pcode        98332 non-null   object
dtypes: int64(2), object(2)
memory usage: 3.1+ MB

0                                                 APARTMENT 2,52 BEDFORD ROAD,LONDON,ENGLAND,SW4 7HJ   SW4
1                                                               107 SHERINGHAM AVENUE,LONDON,N14 4UJ   N14
2                                                       43 SUNNINGDALE,YATE,BRISTOL,ENGLAND,BS37 4HZ  BS37
3  HAWESWATER HOUSE,LINGLEY MERE BUSINESS PARK,LINGLEY GREEN AVENUE,GREAT SANKEY, WARRINGTON,WA5 3LP   WA5
4                           AMBERFIELD BARN HOUSE AMBER LANE,CHART SUTTON,MAIDSTONE,ENGLAND,ME17 3SF  ME17

                                                                                                 address pcode
99995                                           KNAPP HOUSE,SELSLEY WEST,STROUD,GLOUCESTERSHIRE,GL5 5LJ   GL5
99996                                  20 The Broadway, Hampton Court,Way, Thames Ditton,Surrey,Kt7 0Lx   KT7
99997                       MARLAND HOUSE,13 HUDDERSFIELD ROAD,BARNSLEY,SOUTH YORKSHIRE,ENGLAND,S70 2LW   S70
99998                           4 MOUNT SCAR VIEW,SCHOLES,HOLMFIRTH HUDDERSFIELD,WEST YORKSHIRE,HD9 1XH   HD9
99999  Manningham Mills Community Center, Silk Warehouse,Lilycroft Road,Bradford,United Kingdom,Bd9 5Be   BD9

Done

I think that’s it for this post. I seem to be making some progress, and it feels like a good time to take a break.

Resources