Some Clarification Regarding Comment in Previous Post

In the previous article I said the following:

In most cases where I have started a project there really was no end. They all evolved situation by situation, need by need, want by want, … I hope to illustrate that real world — at least for a developing, single developer situation — in the posts related to the current exercise.

Let’s be perfectly clear. No professional developer or a developer in a business environment would be working in this fashion. There would be a specific project goal, some or a lot of design work, a plan of somesort with a few to numerous milestones, a set of tests that each bit of code would need to fulfill/pass, etc., etc.

What I was really commenting on was my misguided approach to learning something about Python without clear guidance from a qualified instructor.

So please take that comment with multiple grains of salt, or your preferred vinegar.

Getting Population Data from the CSV File

I really don’t want to be manually creating dictionaries containing the population data I would like to plot or otherwise analyze/process. Especially given there is this very detailed file sitting on my hard drive. So, I am going to look at getting the data from the file and passing it to the appropriate functions.

I thought about modifying the get_pop_data function to read the file to get the data. Instead of passing it a dictionary, I would pass a file name and let it go from there. And, in some ways it makes sense. As extracting the desired data is very much related to the file’s structure. But, I felt that made the one function do too many things, know about too many things — perhaps not sufficiently separating concerns. So, I have decided to write another function that reads the file and returns the lines of interest as a list. I can then pass that list to the get_pop_data function to get the dictionary that I need to pass to the plotting function. Seems fairly efficient. And, if necessary, it may be possible to modify the functions to deal with different data and/or file structures.

As for the file, the one I downloaded had one lengthy name: WPP2019_PopulationByAgeSex_Medium.csv. I thought about changing it, but decided to leave it alone. Make sure you have a copy in your py_play folder. I don’t really want to commit the file to our version control as I expect I can always download another if needed. So, I going to add the file to my .gitignore. Then commit that change.

git status
git add .gitignore
git commit -m "Added large population data CSV to .gitignore"
git push

To save typing that lengthy name more than once, let’s add, near the top of the program file, a variable with a shorter name, say csv_nm, equal to the string of the file’s name.

csv_nm = 'WPP2019_PopulationByAgeSex_Medium.csv'

So, let’s create get_file_lines(). Note, when writing code, Python allows you to use pass for an expected block of code. This will allow your code to run without any errors. In case you need to test things elsewhere before continuing. We will still need to pass in the country and year so that we can get the correct lines from the file. And, would make sense to include a file name or path. For now I am going to assume the CSV file is in the same folder as the program file, so all that will need to be passed in is the file name.

def get_file_lines(country, year, file_nm):
  pass

As one might expect, Python has a built-in function, open() for opening files for reading, writing or both. When called the function returns a file object which provides methods we can use to interact with the file and its contents. The following should work to give us access to the CSV file, the ‘r’ indicates read-only mode:

csv_fl = open(csv_nm, 'r')

Let’s add that to our function, then, to test that it works, read and print the first line. Note, that we need to close the file once done, or we can end up generating memory issues. So we will close our file after reading the that first line. For now you might want to comment out any other lines that cause output to be generated (e.g. the plot of the 1950 data). Note that once the CSV file is open, we will be using operators from the Python io module to process the file. This is a built-in module, so we don’t need to specifically import it.

def get_file_lines(country, year, file_nm):
  csv_fl = open(csv_nm, 'r')
  line_1 = csv_fl.readline()
  csv_fl.close()
  print("'" + line_1 + "'")

get_file_lines('Canada', '1950', csv_nm)

I got the following in the terminal:

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
'LocID,Location,VarID,Variant,Time,MidPeriod,AgeGrp,AgeGrpStart,AgeGrpSpan,PopMale,PopFemale,PopTotal
'

Notice how that last single quote is on a line by itself? That’s because readline() returns the whole line, including the line feed at the end. The Python str object has numerous methods. At least one of which can help us out here. We could use strip() or rstrip(). For now I am going with the former. All it does is remove whitespace from the left and right of the string to which it is applied. For this purpose, whitespace includes the linefeed character. So, modify the code for reading the first line as follows and run the file again.

line_1 = csv_fl.readline().strip()

And, that’s more like it.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
'LocID,Location,VarID,Variant,Time,MidPeriod,AgeGrp,AgeGrpStart,AgeGrpSpan,PopMale,PopFemale,PopTotal'

As with our code looking through sim_file, we want to read lines until we find one with both the country and year (a loop) in which we are interested. We want to save that line and all subsequent lines that contain the two items in a list. When we get a line that doesn’t contain the two items, we close the file and exit the loop. And return the list to the caller. Sound familiar. Give it a shot. If you have problems, see below. But do try it yourself first. One hint, you will want to use a different version of the for statement.

for line in csv_fl:
  pass

line will contain the full text, including newline character, for the current line in the file. Each time through the loop it will go to a new line. If not exited before hand, when it gets to the end of the file the loop will terminate. You might also want to check out the list append() method. Make sure to test as you go along, but certainly once you think you are done.

def get_file_lines(country, year, file_nm):
  lines = []
  found_pair = False
  csv_fl = open(csv_nm, 'r')
  for line in csv_fl:
    found = country in line and year in line
    if found:
      found_pair = True
      lines.append(line.strip())
    if found_pair and not found:
      break
  csv_fl.close()
  return lines

lines = get_file_lines('Canada', '1950', csv_nm)
print(lines)

I got the following in the Output terminal window.

(base-3.8) R:\learn\py_play>E:/appDev/Miniconda3/envs/base-3.8/python.exe r:/learn/py_play/population_by_age.py
['124,Canada,2,Medium,1950,1950.5,0-4,0,5,835,801,1636', '124,Canada,2,Medium,1950,1950.5,5-9,5,5,671,645,1316', '124,Canada,2,Medium,1950,1950.5,10-14,10,5,570.999,554.001,1125', '124,Canada,2,Medium,1950,1950.5,15-19,15,5,544,534,1078', '124,Canada,2,Medium,1950,1950.5,20-24,20,5,552,557.999,1109.999', '124,Canada,2,Medium,1950,1950.5,25-29,25,5,550,564,1114', '124,Canada,2,Medium,1950,1950.5,30-34,30,5,513,521.001,1034.001', '124,Canada,2,Medium,1950,1950.5,35-39,35,5,487.998,478.001,965.999', '124,Canada,2,Medium,1950,1950.5,40-44,40,5,432.002,410.999,843.001', '124,Canada,2,Medium,1950,1950.5,45-49,45,5,378.001,352.999,731', '124,Canada,2,Medium,1950,1950.5,50-54,50,5,338.001,318,656.001', '124,Canada,2,Medium,1950,1950.5,55-59,55,5,295.999,276,571.999', '124,Canada,2,Medium,1950,1950.5,60-64,60,5,263.001,239,502.001', '124,Canada,2,Medium,1950,1950.5,65-69,65,5,220.999,196.998,417.997', '124,Canada,2,Medium,1950,1950.5,70-74,70,5,156,147,303', '124,Canada,2,Medium,1950,1950.5,75-79,75,5,92.001,92,184.001', '124,Canada,2,Medium,1950,1950.5,80-84,80,5,42.204,49.354,91.558', '124,Canada,2,Medium,1950,1950.5,85-89,85,5,18.001,23.001,41.002', '124,Canada,2,Medium,1950,1950.5,90-94,90,5,4.331,6.5,10.831', '124,Canada,2,Medium,1950,1950.5,95-99,95,5,0.639,1.164,1.803', '124,Canada,2,Medium,1950,1950.5,100+,100,-1,0.07,0.135,0.205']

Looks like I got the data I wanted. So, let’s use the data to plot our chart of Canadian population by age group in 1950. Note: if your terminal window is getting full of stuff and needs a lengthy scroll, enter cls at the command line to remove everything in the window.

lines = get_file_lines('Canada', '1950', csv_nm)
canada_1950 = get_pop_data("Canada", "1950", lines)
plot_bar_chart('1950', canada_1950)

I got what I expected. How about you? I am going to make a couple of commits. One for the new function and revised code to get a plot. Then I am going to clean out all the old development code and data (e.g. sim_file) and commit again. It is apparently a good idea to keep your commits granular. Makes it easier to go back to previous code if necessary.

My file now looks like this:

import matplotlib.pyplot as plt
import numpy as np

csv_nm = 'WPP2019_PopulationByAgeSex_Medium.csv'

def plot_bar_chart(year, pop_data):
  # define the x-labels for the chart
  x_labels = pop_data.keys()
  # get the y-values for each x-label
  x_values = pop_data.values()
  # figure out where to put each of the x-labels based on their size, nice of numpy to help
  y_pos = np.arange(len(x_labels))

  # because of the x-label sizes, we need a largish display
  plt.figure(figsize=(15,7.5))
  # give matplotlib.pyplot the values it needs to sort the chart
  plt.bar(y_pos, x_values, align='center', alpha=0.5)
  # tell it what the x-labels are and where to put them
  plt.xticks(y_pos, x_labels)
  # add some info regarding the axes and give the chart a title.
  plt.xlabel('Age Group')
  plt.ylabel('Population (1000s)')
  plt.title(year + ' World Population by Age Group')

  # generate the plot
  plt.show()

def get_pop_data(country, year, src_data):
  age_group_data = {}
  max_rows = len(src_data)
  # to track whether or not we found the country and year in the first place
  found_pair = False
  # in case things change
  x_grp = 6
  x_pop = 11

  for i in range(max_rows):
    curr_line = src_data[i]
    # if we've previously found a line or lines of interest, but the current line is not, leave the loop
    # note the brackets around the check for country or year, want to make sure the logic is correct
    # brackets ensure that that block of code is calculated together before any other operations
    if found_pair and (country not in curr_line or year not in curr_line):
      break
    if country in curr_line and year in curr_line:
      # record that we found a line with the country and year of interest
      found_pair = True
      # split the CSV into a list
      curr_fields = curr_line.split(',')
      # save the stuff we want to our data dictionary
      age_group_data[curr_fields[x_grp]] = float(curr_fields[x_pop])

  return age_group_data

def get_file_lines(country, year, file_nm):
  lines = []
  found_pair = False
  csv_fl = open(csv_nm, 'r')
  for line in csv_fl:
    found = country in line and year in line
    if found:
      found_pair = True
      lines.append(line.strip())
    if found_pair and not found:
      break
  csv_fl.close()
  return lines

lines = get_file_lines('Canada', '1950', csv_nm)
canada_1950 = get_pop_data("Canada", "1950", lines)
plot_bar_chart('1950', canada_1950)

I think I am happy to call it a day. ‘Til next time.