Average World Age
Well, I am going to use the median as the average, not the mean. Though likely could use either one. The plan is to sample 5 countries at random, get their median age, take the average. Repeat 10 times. See if the mean of data distribution of the median means’s approximates the world average age. I may also look at repeating 20 times and see how that does. But, I am sure it will all take a bit of time to complete those runs. I will eventually create a new module for the code to do this work.
I am going to write a new function in database/rc_names.py to generate a list of random names. Default will be 5 names for now. It will use the new country CSV, validating with the list of countries/regions we created way back when, and return the list of valid names. I will use an existing function to get the data for the list of names from the population database CSV. Then another existing function to get the median age for each of the countries. Save that. Repeat 10 times. Plot the resulting 10 data points, get their mean and compare that to the median age for the world. I will continue using 2011 as my year of interest. But, will eventually add a command line parameter allowing that to be changed.
New Branch
Make sure you merged the previous working branch, then create a new one. Note, I did a test merge using a throwaway branch before the actual merge shown below.
PS R:\learn\py_play> git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
Deleted branch trial-merge (was a98b93d).
PS R:\learn\py_play> git branch -a
* master
py-stats
remotes/origin/master
remotes/origin/py-stats
On branch master
Your branch is up to date with 'origin/master'.
PS R:\learn\py_play> git merge --no-ff py-stats
Merge made by the 'recursive' strategy.
population/chart/descriptive_stats.py | 282 +++++++++++++++++++++++++++++++++-
population/database/rc_names.py | 19 +++
PS R:\learn\py_play> git status
On branch master
(use "git push" to publish your local commits)
PS R:\learn\py_play> git branch -d py-stats
Deleted branch py-stats (was 4d55cc0).
PS R:\learn\py_play> git branch -a
* master
remotes/origin/py-stats
PS R:\learn\py_play> git push
Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
To https://github.com/tooOld2Code/pyPlay.git
98e7307..34b66b5 master -> master
PS R:\learn\py_play> git branch -a
* master
remotes/origin/py-stats
PS R:\learn\py_play> git status
On branch master
PS R:\learn\py_play> git push origin -d py-stats
To https://github.com/tooOld2Code/pyPlay.git
- [deleted] py-stats
* master
remotes/origin/master
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
PS R:\learn\py_play> git status
On branch master
Your branch is up to date with 'origin/master'.
PS R:\learn\py_play> git branch avg-age
PS R:\learn\py_play> git checkout avg-age
Switched to branch 'avg-age'
PS R:\learn\py_play> git status
On branch avg-age
nothing to commit, working tree clean
PS R:\learn\py_play> git push --set-upstream origin avg-age
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
remote:
remote: Create a pull request for 'avg-age' on GitHub by visiting:
remote: https://github.com/tooOld2Code/pyPlay/pull/new/avg-age
remote:
To https://github.com/tooOld2Code/pyPlay.git
* [new branch] avg-age -> avg-age
Branch 'avg-age' set up to track remote branch 'avg-age' from 'origin'.
get_rand_countries():
This is pretty much encapsulating the block of code we had in stats_rand.test.py into a separate function. So, should be pretty straightforward. I have decided, for better or worse to put the function and related module level variables in the rc_names.py module. Seemed like the logical place. And, we already have the chk_name() function in that module. I really wish the original population data CSV had somehow identified countries versus regions. But no such luck. So, we will continue to use mutliple CSV files to get the job done. For better or worse.
I am going to start by modifying my test area to accept command line parameters for test numbers and the like. It does not currently do so, and I would like to better control what code/test gets run when I execute the module during development and testing. I am sure you could sort this all yourself, but for completeness I am going to include my current code.
if __name__ == "__main__":
import argparse
# make sure 'timing' is always at the end of the list. Likely never run again, but already had the code in the module, so kept it.
tst_desc = ['n/a', "find_names()", "chk_names()", "get_rand_countries()", 'timing']
max_tst = len(tst_desc) - 1
# defaults
do_tst = 0
full_nm = ''
do_nm = ''
# always check for cr_list.txt, generate if not found
print()
if pathlib.Path(D_PATH / LST_NM).exists():
print(f"'{D_PATH / LST_NM}' already exists. So, not generating country/region name list file.")
else:
print(f"'{D_PATH / LST_NM}' does not exist. Generating country/region name list file.")
gen_file_list()
# check for test number, any other pertinent parameters
parser = argparse.ArgumentParser()
# long name preceded by --, short by single -, get integer if appropriate
parser.add_argument('--do_test', '-t', type=int, help=f'test number (1-{max_tst})')
# allow multiple words for a country name
parser.add_argument('--country', '-c', help='country whose data to plot', nargs='+')
args = parser.parse_args()
if args.do_test and (args.do_test >= 1 and args.do_test <= max_tst):
do_tst = args.do_test
if args.country:
full_nm = ' '.join(args.country)
do_nm = full_nm
# chk_nm = full_nm
# if (chk_nm := chk_name(full_nm)) != '':
# do_nm = chk_nm
if do_tst:
print(f"\nTest #{do_tst}: {tst_desc[do_tst]}\n")
if do_tst == 1:
f_str = 'Ca'
names = find_names(f_str)
print(f"\tsearching for '{f_str}', found:\n{names}")
f_str = 'States'
names = find_names(f_str, anywhere=True)
print(f"\tsearching for '{f_str}', found:\n{names}")
#print(f"do_nm = '{do_nm}'")
if do_nm:
names = find_names(do_nm, anywhere=True)
print(f"\tsearching for '{full_nm}', found:\n{names}")
if do_tst == 2:
if do_nm:
fnd_nm = chk_name(do_nm)
print(f"chk_name({do_nm}) = {fnd_nm}")
else:
print("Test not run, no name provided via command line paramter.")
if do_tst == 3:
pass
if do_tst == max_tst:
import time
s_tm = time.time()
s_ns = time.time_ns()
nbr_reps = 100
for _ in range(nbr_reps):
names = find_names("Ca", anywhere=True)
e_tm = time.time()
e_ns = time.time_ns()
print(f"\n{nbr_reps} execution{'s' if nbr_reps>1 else ''} of find_names('Ca'):")
t_diff = e_tm - s_tm
print(f"\ttime diff: {e_tm} - {s_tm} = {t_diff:.7f}, average = {t_diff / nbr_reps * 1000:.5f} msec")
ns_diff = e_ns - s_ns
print(f"\ttime_ns diff: {e_ns} - {s_ns} = {ns_diff}, average = {ns_diff / nbr_reps * (10**-9) * 1000:.5f} msec")
Now onto the meat of the matter. Let’s get a list of random names. Bit by bit. I started defining the basic function. Created a variable for the list of valid countries I will return. Loaded the country CSV into a pandas DataFrame. And printed the first few rows to make sure it loaded, before continuing. Then I added the code for the loop to get the desired number of countries, and selected a random row. Again printing to make sure things work before moving on to validating names against the population data CSV country/region name list.
Of course, I need the function that cleans up the names, preparing for the validation step, randomly selected from the country CSV that we used in the previous post or two. Which I am including in this module, but renaming as clean_c_name(). I did modify it somewhat to include some steps that were previoulsy in the random selection loop rather than the function. Once that was working, I did a chk_name() on the tidied country name. If good and not already in the list — I am making my selection of random countries without replacement, added it to the list and continued until I had the requisite number of names.
# with the other imports at the top of the module
import pandas as pd
# with the other module globals near the top of the file, e.g. CSV_NM
CSV_COUNTRY = 'country-continent-codes.csv'
# after the chk_name() function definition
def clean_c_name(c_name):
cln_nm = c_name
chk_for = ['people', 'republic']
for label in chk_for:
if label.casefold() in c_name.casefold():
cln_nm = c_name.split(" ")[0]
break
if ' ' in cln_nm:
cln_nm = cln_nm.replace(' ', '_')
return cln_nm.casefold()
def get_rand_countries(c_nbr=5, fl_nm=CSV_COUNTRY):
list_rand = []
# first line is comment, second line is header list
df = pd.read_csv(D_PATH / fl_nm, comment='#', usecols = ['continent','country'])
c_cnt = 0
while True:
rand_r = df.sample()
c_nm = rand_r['country'].iloc[0]
t_nm = clean_c_name(c_nm)
chk_nm = chk_name(t_nm)
if chk_nm:
if chk_nm not in list_rand:
# don't duplicate a name
c_cnt += 1
list_rand.append(chk_nm)
if c_cnt >= c_nbr:
break
return list_rand
And, a sample of my test output (have not shown the test code):
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/database/rc_names.py -t 3
'data\cr_list.txt' already exists. So, not generating country/region name list file.
Test #3: get_rand_countries()
randomly selected: ['Iran (Islamic Republic of)', 'Cuba', 'Netherlands', 'Tajikistan', 'Botswana']
Now, I think we can move on to coding the module that will do the work of getting the median for the selected countries, repeating the step multiple times. I plan to plot the data once we have it to see if there is anything obvious happening. So a new module, py_play/population/est_avg_age.py.
Code, Test, Rethink, Refactor and Repeat
While researching estimating population statistics through sampling I began to realize that I would need to repeat my sampling a large number of times. Certainly more than the 10 or 20 times I mentioned above. Probably more like an hundred or a thousand or ten thousand times. My initial approach was:
repeat 20 times
get random sample of 5 country names (multiple passes, at least 5, through name list file)
get population data for the specified year for each name (1 time through the csv file)
calculate median age for each country, save mean of the median ages () to list
save data to file
calculate mean of 20 results
calculate std dev of 20 results
plot 20 results to histogram
A little testing said that sampling 8 countries with 60 repeats took approximatley 4 minutes. A 1000 repeats is going to take a rather longer time. Something on the order of an hour. Not my idea of fun. Even a few hundred repeats are going to take longer than I would like. So, I decided it was time to rethink my coding strategy.
I am going to look at putting the country text list into a variable in memory. And, rewrite the validation function accordingly. Then I am going to go through the country-continent list (already in memory) and for each valid name I will get the median age and total population for the year of concern. I will add that information to the numpy array that contains the country and continent data. I may look at dropping any rows that don’t contain, from our perspective, valid names.
Once that is done, I will start a loop to select rows at random from the updated array. And use that to generate the estimated average age for each selected sample. Saving those to lists. Then I will plot to look for something interesting of which we can take advantage. I will likely do this a few times for various numbers of repeats to see how that affects the final outcome.
Now this will all take some testing, as I don’t know how much memory I will end up using. Expect my computer will be able to handle it but…
So, I am for now going to create a test module, which will end up doing all the work. Which will also involve re-coding or copying a goodly number of functions. Going to take some time and very likely a few posts. Always seems to end up taking longer than I expected doesn’t it.
New Test File: get_avg_age.test.py
In my initial coding, I realized that my in memory list of validated names was rather small compared to the original. And, it was missing some rather important countries. E.G. United Kingdom. On a bit more looking at things I realized that the population data CSV had a LocID column, which appeared to match the number column in the country-continent CSV. So I decided to build a new list of country/region names and LocID from the population data CSV. Then I will try to merge that with the dataframe I create from the country-continent CSV.
I am going to start slow and easy. First I will create the new CSV file. Then I will read my two lists into variables in memory. Then see what I need to do to merge the two on the id column. The fastest way possible. So, likely a bit of timing code as well. See you back here in an hour or two.
cr_code_list.csv
Okay, from our population database I want to get a list of each country/region in the data file and its LocID. I am going to give these columns the labels: ’number’ for the id and ‘valid’ for the country/region name. ’number’ is the name of the column in the country-continent CSV file. Figure things will be easier if they match. And, in my early test code I had labelled the column with the name from the data file as ‘valid’. So…
I have added all my usual front end stuff to the test module. The various path building code to make sure I can load packages and save things in the right place. Along with some imports I have the following.
import csv
import pathlib
# Test for the two most likely starting working directories.
# If found set DATA_DIR accordingly. Using pathlib to cover operating system issues.
# If not one of them print message and exit.
D_PATH = ''
p = pathlib.Path.cwd()
if p.name == 'py_play':
D_PATH = pathlib.Path('./data')
elif p.name == 'population':
D_PATH = pathlib.Path('../data')
elif p.name == 'database':
D_PATH = pathlib.Path('../../data')
else:
print(f"\nCurrent working directory is an unexpected location, {p}!")
print(f"Please run the application from the root directory. Quitting application")
exit(1)
if __name__ == '__main__':
# if running module directly, need to set system path appropriately so that Python can find local packages
import sys
file = pathlib.Path(__file__).resolve()
parent, root = file.parent, file.parents[1]
#print(f"parent: {parent}; root: {root}\n")
sys.path.append(str(root))
# Additionally remove the current file's directory from sys.path
try:
sys.path.remove(str(parent))
except ValueError: # Already removed
pass
from database import population as pdb
Now let’s build and write that new CSV file.
# module constants
_CSV_NM = 'WPP2019_PopulationByAgeSex_Medium.csv'
_LST_NM = 'cr_list.txt'
_LST_CC_NM = 'cr_code_list.csv'
# do I want to build the new cr_code_list.csv file or use whatever currently exists
_BLD_DB_CC_CSV = True
def make_db_name_id_csv(fl_nm=_LST_CC_NM):
with open(D_PATH / _CSV_NM, 'r') as csv_fl, open(D_PATH / fl_nm, 'w') as lst_fl:
lst_fl.write(f"number,valid\n")
csv_fl.readline()
r = csv.reader(csv_fl, delimiter=',', quotechar='"')
curr_rc_nm = ''
for row in r:
if curr_rc_nm != row[1]:
curr_rc_nm = row[1]
if ',' not in row[1]:
lst_fl.write(f"{row[0]},{row[1]}\n")
else:
lst_fl.write(f"{row[0]},\"{row[1]}\"\n")
if __name__ == '__main__':
import time
if _BLD_DB_CC_CSV:
s_tm = time.time()
make_db_name_id_csv()
e_tm = time.time()
print(f"building {_LST_CC_NM} took {e_tm - s_tm}")
And I get the following.
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/get_avg_age.test.py
building cr_code_list.csv took 3.4818835258483887
Quick look at cr_code_list.csv seems to indicate I got what I wanted. There are 440 entries in the file. I have added the file to my .gitignore. (And committed the change.)
mod_country-continent.csv
And, now onto merging the data into the new csv file with that of the country-continent file. Keeping only those that match based on the country code (e.g. LocID, number, etc.). I started by adding an import and some more variables/constants at the top of the file. And a new if block in the if __name__ == '__main__'
block.
...
import csv
import pathlib
import pandas as pd
...
# module constants
_CSV_NM = 'WPP2019_PopulationByAgeSex_Medium.csv'
_LST_NM = 'cr_list.txt'
_LST_CC_NM = 'cr_code_list.csv'
_CSV_COUNTRY = 'country-continent-codes.csv'
_CSV_MOD_CC = 'mod_country-continent.csv'
_BLD_DB_CC_CSV = False
_BLD_MOD_CSV = True
...
if _BLD_MOD_CSV:
s_tm = time.time()
Then I load the two CSV files into pandas dataframes, do a bit of tidying and finally merge them into one dataframe. The new dataframe is then written to a new CSV file. (Also added to my .gitignore. No sense wasting space committing stuff we generate, or can generate, via code.)
if _BLD_MOD_CSV:
s_tm = time.time()
# read list of country/region names in population database into memory
df_valid_names = pd.read_csv(D_PATH / _LST_CC_NM, comment='#')
# read csv of countries/continents in country-continent csv into memory
df_country = pd.read_csv(D_PATH / _CSV_COUNTRY, comment='#', usecols = ['continent','country','number'])
# get rid of duplicate rows
df_country = df_country.loc[df_country['country'].shift(1) != df_country['country']]
# merge two dataframes keeping only rows with a match on the 'number' column
merged = pd.merge(df_country, df_valid_names, on='number', how='inner')
# write merged data to csv file
merged.to_csv(D_PATH / _CSV_MOD_CC, index=False)
e_tm = time.time()
diff_tm = e_tm - s_tm
print(f"modifying and saving dataframe took {diff_tm:.2f}")
And,
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/get_avg_age.test.py
building cr_code_list.csv took 3.5469698905944824
199 entries in the file. Seems pretty reasonable to me.
Done for Now
Another lengthy post, with a bunch of coding and testing in the background. So, I think I will call it a day.
But, just so you, know, I am now thinking of loading the data for all countries for the specified year into a dictionary in memory. Along with the with the new continent-country CSV in an in-memory dataframe. So two in-memory variables with all the info needed go generate multiple samples without needing to read a CSV for each sample. Assuming everything will fit in memory. But, that is for next time.
Next time we will start by getting the data into memory for all the countries in the dataframe of continents and countries, with the valid country names for out population data CSV. Until then…
A Bit More Work Before We Are Really Done
I want to make sure I track (Git) the code we just wrote to generate the two new CSV files. I do not track test files. So, I am going to move the function make_db_name_id_csv() into the database/rc_names.py module. I am also going to put the code for generating mod_country-continent.csv into a function in the rc_names module as well. Something like make_mod_country_csv(). Committing my additions as I go along. This may mean that I have to modify or add parameters to the function definitions. I will then import them in this module and use them in the appropriate sections.
For ‘completeness’ here’s my modified code from the two affected files.
# in population/rc_names.py
...
# defaults
CSV_NM = 'WPP2019_PopulationByAgeSex_Medium.csv'
LST_NM = 'cr_list.txt'
LST_CC_NM = 'cr_code_list.csv'
CSV_COUNTRY = 'country-continent-codes.csv'
CSV_MOD_CC = 'mod_country-continent.csv'
...
def make_db_name_id_csv(fl_csv=CSV_NM, fl_lst=LST_CC_NM):
with open(D_PATH / fl_csv, 'r') as csv_fl, open(D_PATH / fl_lst, 'w') as lst_fl:
lst_fl.write(f"number,valid\n")
csv_fl.readline()
r = csv.reader(csv_fl, delimiter=',', quotechar='"')
curr_rc_nm = ''
for row in r:
if curr_rc_nm != row[1]:
curr_rc_nm = row[1]
if ',' not in row[1]:
lst_fl.write(f"{row[0]},{row[1]}\n")
else:
lst_fl.write(f"{row[0]},\"{row[1]}\"\n")
def make_mod_country_csv(fl_nid=LST_CC_NM, fl_cc=CSV_COUNTRY, fl_new=CSV_MOD_CC):
# read list of country/region names, with loc id, in population database into memory
df_valid_names = pd.read_csv(D_PATH / fl_nid, comment='#')
# read csv of countries/continents in country-continent csv into memory
df_country = pd.read_csv(D_PATH / fl_cc, comment='#', usecols = ['continent','country','number'])
# get rid of duplicate rows
df_country = df_country.loc[df_country['country'].shift(1) != df_country['country']]
# merge two dataframes keeping only rows with a match on the 'number' column
merged = pd.merge(df_country, df_valid_names, on='number', how='inner')
# write merged data to csv file
merged.to_csv(D_PATH / fl_new, index=False)
...
# in play/get_avg_age.test.py
...
if __name__ == '__main__':
import time
if _BLD_DB_CC_CSV:
s_tm = time.time()
rcn.make_db_name_id_csv()
e_tm = time.time()
print(f"building {_LST_CC_NM} took {e_tm - s_tm}")
if _BLD_MOD_CSV:
s_tm = time.time()
rcn.make_mod_country_csv()
e_tm = time.time()
diff_tm = e_tm - s_tm
print(f"modifying and saving dataframe took {diff_tm:.2f}")
if _DO_SAMPLING:
...
And, here’s a little teaser for next time.
See you next week. Until then be happy and play with your code!
Resources
- pandas.read_csv
- pandas.DataFrame.to_csv
- pandas.DataFrame.merge
- get list from pandas dataframe column
- Pandas - Delete all consecutive rows except the first one which share same column value
- Pandas: Drop consecutive duplicates
- Drop columns if rows contain a specific value in Pandas
- Pandas: Drop rows from a dataframe with missing values or NaN in columns
- How to add Empty Column to Dataframe in Pandas?
- How to create an empty column in a Pandas dataframe in Python
- Iterating through columns and rows in NumPy and Pandas
- How To Loop Through Pandas Rows?
- Python statistics | mean() function
- stdev() method in Python statistics module