Did You Think We Were Done?

I know that this is likely a relatively useless exercise for many, but it is something that could likely be applied to other problems. And, for a learning exercise it is not horribly complicated. In most cases where I have started a project there really was no end. They all evolved situation by situation, need by need, want by want, … I hope to illustrate that real world — at least for a developing, single developer situation — in the posts related to the current exercise. Hopefully I don’t fail.

What if?

Okay, what if I’d also like to display a bar chart showing the population in a single country for a specific age group or groups for some period of years? Or maybe compare a specific age category for two or more countries for some period of time? Well, I think our function could almost handle some of that, barring the title and axis labels. Unfortunately, we are still manually coding the data we need. Which, given that big file of data I downloaded, is really a shame. Where to go next?

I think it’s time we looked at finding a way to get the data we want out of that big CSV file. That means we are going to have figure out how to open the file, read the file contents and somehow get the data we really want. Python likely makes some of that easy enough. But, we may also have to dig deep into the how to for other aspects of that process.

That said, let’s make that our next iterative step in fleshing out our project.

Sort Out Pulling Data from Rows in File

Development Data

I had been planning to jump in and work on reading the file. But decided I should sort out extracting data from the lines in the file first. To that end I am going to put the lines for two countries spanning a total of three years into a list (array if you prefer) to give me something to work with. Note lists can be manually created by enclosing the items of the list in square brackets, each item separated by a comma. I am going to use the data for Cameroon for 2100 and Canada for 1950 and 1951. This will simulate the file better than a single country and year. And I am going to try and get the data for Canada in 1950 (country and year I was born in) into a dictionary. I am calling the test data sim_file as in simulated file. See below.

sim_file = [
"120,Cameroon,2,Medium,2100,2100.5,0-4,0,5,3513.238,3439.959,6953.197",
"120,Cameroon,2,Medium,2100,2100.5,5-9,5,5,3493.873,3431.061,6924.934",
"120,Cameroon,2,Medium,2100,2100.5,10-14,10,5,3486.994,3432.481,6919.475",
"120,Cameroon,2,Medium,2100,2100.5,15-19,15,5,3445.934,3401.324,6847.258",
"120,Cameroon,2,Medium,2100,2100.5,20-24,20,5,3389.152,3360.185,6749.337",
"120,Cameroon,2,Medium,2100,2100.5,25-29,25,5,3310.664,3298.652,6609.316",
"120,Cameroon,2,Medium,2100,2100.5,30-34,30,5,3208.392,3212.046,6420.438",
"120,Cameroon,2,Medium,2100,2100.5,35-39,35,5,3087.928,3107.085,6195.013",
"120,Cameroon,2,Medium,2100,2100.5,40-44,40,5,2949.061,2983.937,5932.998",
"120,Cameroon,2,Medium,2100,2100.5,45-49,45,5,2788.762,2841.762,5630.524",
"120,Cameroon,2,Medium,2100,2100.5,50-54,50,5,2601.572,2680.215,5281.787",
"120,Cameroon,2,Medium,2100,2100.5,55-59,55,5,2364.338,2476.149,4840.487",
"120,Cameroon,2,Medium,2100,2100.5,60-64,60,5,2082.997,2235.282,4318.279",
"120,Cameroon,2,Medium,2100,2100.5,65-69,65,5,1742.244,1940.014,3682.258",
"120,Cameroon,2,Medium,2100,2100.5,70-74,70,5,1348.414,1582.72,2931.134",
"120,Cameroon,2,Medium,2100,2100.5,75-79,75,5,917.946,1164.055,2082.001",
"120,Cameroon,2,Medium,2100,2100.5,80-84,80,5,508.125,720.473,1228.598",
"120,Cameroon,2,Medium,2100,2100.5,85-89,85,5,198.879,329.143,528.022",
"120,Cameroon,2,Medium,2100,2100.5,90-94,90,5,43.644,89.659,133.303",
"120,Cameroon,2,Medium,2100,2100.5,95-99,95,5,4.296,11.69,15.986",
"120,Cameroon,2,Medium,2100,2100.5,100+,100,-1,0.189,0.644,0.833",
"124,Canada,2,Medium,1950,1950.5,0-4,0,5,835,801,1636",
"124,Canada,2,Medium,1950,1950.5,5-9,5,5,671,645,1316",
"124,Canada,2,Medium,1950,1950.5,10-14,10,5,570.999,554.001,1125",
"124,Canada,2,Medium,1950,1950.5,15-19,15,5,544,534,1078",
"124,Canada,2,Medium,1950,1950.5,20-24,20,5,552,557.999,1109.999",
"124,Canada,2,Medium,1950,1950.5,25-29,25,5,550,564,1114",
"124,Canada,2,Medium,1950,1950.5,30-34,30,5,513,521.001,1034.001",
"124,Canada,2,Medium,1950,1950.5,35-39,35,5,487.998,478.001,965.999",
"124,Canada,2,Medium,1950,1950.5,40-44,40,5,432.002,410.999,843.001",
"124,Canada,2,Medium,1950,1950.5,45-49,45,5,378.001,352.999,731",
"124,Canada,2,Medium,1950,1950.5,50-54,50,5,338.001,318,656.001",
"124,Canada,2,Medium,1950,1950.5,55-59,55,5,295.999,276,571.999",
"124,Canada,2,Medium,1950,1950.5,60-64,60,5,263.001,239,502.001",
"124,Canada,2,Medium,1950,1950.5,65-69,65,5,220.999,196.998,417.997",
"124,Canada,2,Medium,1950,1950.5,70-74,70,5,156,147,303",
"124,Canada,2,Medium,1950,1950.5,75-79,75,5,92.001,92,184.001",
"124,Canada,2,Medium,1950,1950.5,80-84,80,5,42.204,49.354,91.558",
"124,Canada,2,Medium,1950,1950.5,85-89,85,5,18.001,23.001,41.002",
"124,Canada,2,Medium,1950,1950.5,90-94,90,5,4.331,6.5,10.831",
"124,Canada,2,Medium,1950,1950.5,95-99,95,5,0.639,1.164,1.803",
"124,Canada,2,Medium,1950,1950.5,100+,100,-1,0.07,0.135,0.205",
"124,Canada,2,Medium,1951,1951.5,0-4,0,5,867.135,831.867,1699.002",
"124,Canada,2,Medium,1951,1951.5,5-9,5,5,705.681,677.228,1382.909",
"124,Canada,2,Medium,1951,1951.5,10-14,10,5,586.194,567.989,1154.183",
"124,Canada,2,Medium,1951,1951.5,15-19,15,5,547.746,536.79,1084.536",
"124,Canada,2,Medium,1951,1951.5,20-24,20,5,558.138,558.77,1116.908",
"124,Canada,2,Medium,1951,1951.5,25-29,25,5,560.241,572.987,1133.228",
"124,Canada,2,Medium,1951,1951.5,30-34,30,5,526.399,535.477,1061.876",
"124,Canada,2,Medium,1951,1951.5,35-39,35,5,499.334,490.775,990.109",
"124,Canada,2,Medium,1951,1951.5,40-44,40,5,446.194,424.67,870.864",
"124,Canada,2,Medium,1951,1951.5,45-49,45,5,387.503,361.446,748.949",
"124,Canada,2,Medium,1951,1951.5,50-54,50,5,342.321,322.413,664.734",
"124,Canada,2,Medium,1951,1951.5,55-59,55,5,297.763,280.38,578.143",
"124,Canada,2,Medium,1951,1951.5,60-64,60,5,264.217,242.996,507.213",
"124,Canada,2,Medium,1951,1951.5,65-69,65,5,224.159,202.469,426.628",
"124,Canada,2,Medium,1951,1951.5,70-74,70,5,160.586,150.842,311.428",
"124,Canada,2,Medium,1951,1951.5,75-79,75,5,96.649,96.821,193.47",
"124,Canada,2,Medium,1951,1951.5,80-84,80,5,45.011,51.55,96.561",
"124,Canada,2,Medium,1951,1951.5,85-89,85,5,18.739,23.99,42.729",
"124,Canada,2,Medium,1951,1951.5,90-94,90,5,5.197,7.522,12.719",
"124,Canada,2,Medium,1951,1951.5,95-99,95,5,0.749,1.316,2.065",
"124,Canada,2,Medium,1951,1951.5,100+,100,-1,0.064,0.131,0.195"
]

Note that each entry in the list is enclosed in double quotes. Python considers anything enclosed in quotes a string. I.E. textual data. Though there are other ways to generate strings as well. I have done this as I know that when we read from the file, each line we get will be a Python string.

Add the above to your working file. For now also comment out the line calling the plot function. We don’t want that running everytime we test something.

Working With Lists

Python has a built-in len() function. We can access any row in the list by using sim_file[x] where x is the index for that row. The available indices are 0, 1, ... max_index where max_index = len(sim_file) - 1.

So let’s just print the last row. Do something like:

max_index = len(sim_file) - 1
# can only concatenate string with a string, so convert max_index, an integer, to a string
# using Python's built-in str() function
print("max_index = " + str(max_index))
print("last line = " + sim_file[max_index])

It probably would be a good idea for you to run the code with print("max_index = " + max_index). You will of course get an error, but it will likely help in the learning process. You will also note that VS Code did not flag that as an error.

Fix the error, save and run the file. I got the following output in the Terminal tab in the Output Window:

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
max_index = 62
last line = 124,Canada,2,Medium,1951,1951.5,100+,100,-1,0.064,0.131,0.195

If max_index is 62, then there are 63 lines in sim_file. Or, 21 lines per year.

Loops

To build our population dictionary for Canada for 1950 we will need to read every row in sim_file until we find the country name of Canada, and the year of 1950. Once we read a row that doesn’t have a year of 1950 we will know we are done. This calls for something programmers refer to as a loop. We repeatedly run a block of code from some initial state until we are done (a final state). At which point we leave the loop, and continue on with our program.

I propose we create an empty dictionary. Then we start a loop to read each line in the simulated file. Doing nothing until we find Canada and 1950 in a line. Then we process each line, adding the desired information to the dictionary, until we hit one that doesn’t have Canada and 1950 in it. At that point we leave the loop, and we should have our dictionary of population by age group for Canada in 1950.

Every programmer likely has there own way to tackle this process. One that is repeated many times in large programming projects. Because we will likely want to get the data for different countries and years from our filw, we will put our code into a function that we can use repeatedly. For now we will pass in our simulated file (dictionary), along with the country and year we want to get the data for. And, the function, when done, will return the dictionary to the caller.

Start by Defining a New Function

So let’s start by defining the basic function, and an empty dictionary. Oh yes, you can also delete that play code we wrote earlier. Don’t delete sim_file.

def get_pop_data(country, year, src_data):
  age_group_data = {}
  
  return age_group_data

Python for Statement

The most basic way to loop through a list in Python is to use the for statement. We are going to execute a loop that counts through the indices of sim_file. In the block of code within the loop we will use the indices to access the lines in the simulated file. Note that we will have to indent our code for the function definition, then again in the for loop. As stated previously, indentation is Python’s way of keeping track of which code blocks belong together or within some controlling statement. The basics will look like this:

def get_pop_data(country, year, src_data):
  age_group_data = {}
  max_rows = len(src_data)
  for i in range(max_rows):
    curr_line = src_data[i]
    <do stuff if appropriate>

  return age_group_data

Within in the function we can access the data passed in via the src_data parameter. We are using the built-in functions len() to get the end condition for the loop, and range() to get all the indices within our source data list. Eventually that list will in fact be a file.

Note: the range function is an iterator returning an iterable that, as written above, provides the caller with the numbers 1 - (max_rows - 1) in sequence once per call. In our case 1 - 62 will be returned in sequence on each run through the loop. The loop will terminate following the last value being returned by range.

Go ahead and add the initial code for the new function to your file (don’t copy/paste), leaving out the do stuff line. Save the file. Note that VS Code warns you that there is an unused variable — yellow squiggly line under the variable name where it was initially defined, and a yellow 1 to the right of the file name in the Explorer shown in the Side Bar. Now onto the <do stuff if appropriate>.

How to Decide Whether to Process a Line or Not

So how do we figure out which lines to process. Well we know that we want only want to process lines that contain our country name and the year of interest. We can make this determination a number of ways, but the simplest is likely to use the Python in operator. This operator tests to see whether some element exists within some sequence (think list, set, etc). But it can also be used with some specialised sequences (such as str, bytes and bytearray) for subsequence testing. So, if our line contains the country name, "Canada" in curr_line should return True. And False otherwise. And ditto for "1950" in curr_line. And, if both are true simultaneously, we have a line we want to extract data from.

To test that we need another of Python’s operators and — a boolean operator. The following line will equate to True only if both checks are True. Otherwise it will return False.

  "Canada" in curr_line and "1950" in curr_line

Within our loop we can use an if condition to test if we have a line we want to work with or not. Let’s give it a try. And note the additional indentation following the if condition/statement to define the block of code execution is determined by the if statement. Also note that I have replaced the “Canada” and “1950” by the parameter variables in the function definition. We want these to be determined at the time the function is called, not set in stone. Modify your function code to look something like follows. And add a line of code to call (i.e. execute) the function passing the required arguments. Save and run the file.

Note: this is a lot of play around/test code. I am just testing ideas. The final code will be more to the point.

def get_pop_data(country, year, src_data):
  age_group_data = {}
  max_rows = len(src_data)
  # these will be used to track the lines we and are not interested in
  is_in = ""
  is_not = ""

  for i in range(max_rows):
    curr_line = src_data[i]
    if country in curr_line and year in curr_line:
      # if something already in string add a comma and space before adding next line number
      if len(is_in) > 0:
        is_in += ", "
      # remember to convert line number to string
      is_in += str(i + 1)
    else:
      if len(is_not) > 0:
        is_not += ", "
      is_not += str(i + 1)

  print("lines: " + is_in + " contained both " + country + " and " + year)
  print("lines: " + is_not + " did not contain both " + country + " and " + year)

  return age_group_data

get_pop_data("Canada", "1950", sim_file)

I got the following output.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
lines: 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42 contained both Canada and 1950      
lines: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63 did not contain both Canada and 1950

Given we have three country/year combinations of 21 lines each, and the lines for Canada/1950 are the second set in sim_file that looks about right. Try changing the function call to look for Canada/2019 and see what you get. Hopefully no lines contained that combination of country and year.

We can get rid of all the lines we don’t need. There is one more thing to consider with respect to the loop. In our case going through the extra lines in the list is not a big deal. But what if we were dealing with a list of thousands of lines and 10,000 of those lines were after the ones we were interested in. Really don’t want to wait on the program to process all those unnecessary lines.

I reckon if we track when we find the lines we are interested, and once found hit a line that doesn’t contain our country and year of interest, we can just leave the loop. The for statement can be left (terminated) by executing the break statement. We just need to figure out when to issue the break. More logic needs to be added, and at least one more variable to track. Have a look at the following. Then give it a try.

def get_pop_data(country, year, src_data):
  age_group_data = {}
  max_rows = len(src_data)
  # these will be used to track the lines we were interested in, and
  # to track whether or not we found the country and year in the first place
  found_pair = False
  is_in = ""

  for i in range(max_rows):
    curr_line = src_data[i]
    # let's see what lines get processed
    print("processing line " + str(i + 1))
    # if we've previously found a line or lines of interest, but the current line is not, leave the loop
    # note the brackets around the check for country or year, want to make sure the logic is correct
    # brackets ensure that that block of code is calculated together before any other operations
    if found_pair and (country not in curr_line or year not in curr_line):
      break
    if country in curr_line and year in curr_line:
      # record that we found a line with the country and year of interest
      found_pair = True
      if len(is_in) > 0:
        is_in += ", "
      is_in += str(i + 1)

  print("lines: " + is_in + " contained both " + country + " and " + year)

  return age_group_data

get_pop_data("Canada", "1950", sim_file)

Hopefully you got something like:

processing line 9
processing line 10
processing line 11
processing line 12
processing line 13
processing line 14
processing line 15
processing line 16
processing line 17
processing line 18
processing line 19
processing line 20
processing line 21
processing line 22
processing line 23
processing line 24
processing line 25
processing line 26
processing line 27
processing line 28
processing line 29
processing line 30
processing line 31
processing line 32
processing line 33
processing line 34
processing line 35
processing line 36
processing line 37
processing line 38
processing line 39
processing line 40
processing line 41
processing line 42
processing line 43
lines: 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42 contained both Canada and 1950      

I call this progress. Now to the last step. Building our dictionary. Once again you can get rid of all the lines related to our testing. You know the stuff we print to see if things work and any related variables.

The lines in our file respresent a data type referred to as comma separated values, CSV for short. Essentially each item of data (field) is separated from all the others using a comma. There are numerous forms of character separated values. It all depends on what characters one might find in their data. Often the first line of a CSV file has a header row telling you what each field value represents. Fundamentally it is the textual representation of a table of data.

There are packages specifically written to handle CSV data files. But, for now, I think I will just go with what basic Python has available. The first of these is str.split(sep=None, maxsplit=-1). It splits a string into the pieces separated by sep, returning a list containing each field found.

Try running this line of code print(sim_file[22].split(',')). I got:

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
['124', 'Canada', '2', 'Medium', '1950', '1950.5', '5-9', '5', '5', '671', '645', '1316']

The [] indicate that the item being printed is a list (array if you like). You will note that the country name is at index 1, the year at index 4, the age group at index 6 and the total population at index 11. So for each line of interest we need only add a key:value pair comprising curr_fields[6]: curr_fields[11] to our dictionary. Let’s give it a try.

Well first you should know that we can add a key/value pair to a dictionary using the following:

age_group_data[key] = value

So here’s my updated function. Oh, yes forgot to mention that Python let’s you use single quotes as well as double quotes to delimit a string value.

def get_pop_data(country, year, src_data):
  age_group_data = {}
    max_rows = len(src_data)
  # to track whether or not we found the country and year in the first place
  found_pair = False
  # in case things change
  x_grp = 6
  x_pop = 11
  
  for i in range(max_rows):
    curr_line = src_data[i]
    # if we've previously found a line or lines of interest, but the current line is not, leave the loop
    # note the brackets around the check for country or year, want to make sure the logic is correct
    # brackets ensure that that block of code is calculated together before any other operations
    if found_pair and (country not in curr_line or year not in curr_line):
      break
    if country in curr_line and year in curr_line:
      # record that we found a line with the country and year of interest
      found_pair = True
      # split the CSV into a list
      curr_fields = curr_line.split(',')
      # save the stuff we want to our data dictionary
      age_group_data[curr_fields[x_grp]] = curr_fields[x_pop]

  return age_group_data

Now we don’t just want to generate our dictionary and throw it away. So we will assign the output of the function (the stuff from return age_group_data) to a variable. Then for testing purposes print that variable in the terminal.

canada_1950 = get_pop_data("Canada", "1950", sim_file)
print(canada_1950)

Compare the output to the sim_file table and it looks like we got what we want.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
{'0-4': '1636', '5-9': '1316', '10-14': '1125', '15-19': '1078', '20-24': '1109.999', '25-29': '1114', '30-34': '1034.001', '35-39': '965.999', '40-44': '843.001', '45-49': '731', '50-54': '656.001', '55-59': '571.999', '60-64': '502.001', '65-69': '417.997', '70-74': '303', '75-79': '184.001', '80-84': '91.558', '85-89': '41.002', '90-94': '10.831', '95-99': '1.803', '100+': '0.205'}

So, let’s plot it and see what happens.

plot_bar_chart('1950', canada_1950)

Wow! That isn’t what I expected. Do you see what went wrong. Look at the output of the print(canada_1950) statement. Each of the population values has single quotes around it. Python thinks they’re strings. And, pyplot is handling them accordingly. As I mentioned, each of the fields in the CSV file will be strings following the split operation. So, in our function, we need to turn the data for the population field into a number. Earlier we used str() to turn a number into a string when doing some concatenation. There are similar functions in Python to turn strings, well valid strings, into numbers of one sort or another. For example int('200') will return the integer 200. And, float('234.56') will return the decmial number 234.56. int() won’t work with the string `234.56’. But float() will work with the string ‘200’. So, let’s modify the line adding the pairs to our dictionary to convert the population string value to a float.

age_group_data[curr_fields[x_grp]] = float(curr_fields[x_pop])

Save and run the file. Bingo!

I’d say that’s enough progress for today. Don’t forget to commit your work and push it to GitHub.

r:\learn\py_play>git status
r:\learn\py_play>git add population_by_age.py
r:\learn\py_play>git commit -m "added function to get data from csv file: get_pop_data(), not yet finished"
r:\learn\py_play>git push

Resources

Python Docs

Other Sources