User’s Don’t Do What We Want

In the last post I mentioned we didn’t deal with users typing user names in lowercase — the CSV file uses title case — more or less. Then there was the issue of finding the wrong country or region because only a portion of the name was given by the user. When playing around with things since the last post, I also discovered that there was at least one region name which contained commas.

1200,"African, Caribbean and Pacific (ACP) Group of States",2,Medium,1972,1972.5,55-59,55,5,4018.719,4317.827,8336.546

In the CSV file the name was enclosed in double quotes. This is a common way to tell CSV parsers to ignore any delimiters (i.e. the comma used to separate the fields on each row) in that stretch of text. So, ,"stuff, sttuf, stttf", means the field value is stuff, sttuf, stttf. The parser should not split the information between the quotes on the commas. My simple str.split(',') approach does not deal with this situation properly.

So, I have decided to do another post and fix all that stuff whatever way I can. Then, in the next post, I am going to go off on a wild goose chase. Always fun that.

Handle the Case of the User Input

Okay, let’s perhaps start with the one that seems the easiest. The letter case of the user input. Well, you know it won’t be really easy. I have been going through a variety of options, in my head and using test code. Has seriously held up the writing of this post. Well there were a couple of other issues, of which I am sure we are all aware.

There is a str.capitalize() method in Python. But it only capitalizes the first letter of a string. But there is also a str.title() method that capitalizes the first letter of each word in str. Turns out that is also not quite what we want. See the region name above. Note, and and of are not capitalized. And if you execute ’(ACP)’.title() you get (Acp). Not what we want.

I thought about using a simple function to get what we want. Start with title() then do some, or maybe a bunch of, fixing up. And of course I haven’t looked at every name in the file. So, over time modifications might be required. In the end this struck me as a bad situation. I was looking at something like this:

def to_proper_case(region):
  tmp_name = region.title()
  tmp_name = tmp_name.replace('Of','of')
  tmp_name = tmp_name.replace('And','and')
  if '(Acp)' in tmp_name:
    tmp_name = tmp_name.replace('(Acp)','(ACP)')
  return tmp_name

The function would have to somehow account for every ocurrence of things like (ACP), (ALADI), etc. I expect that is going to lead to a coding nightmare.

Note, I was going to take the fix_case() function approach because I had chosen not to parse every line of CSV as we traverse the file. Then I thought, let’s just lower case the region name and the line we are searching while executing the search, i.e. the in expression. Seemed like a simple enough solution.

def is_name_in_line(region, ln_txt):
  return region.lower() in ln_txt.lower()
  

But, there is still the issue of getting the data for the country/region the user really wants. For example, when the user enters new zealand or New Zealand, my search finds Australia/New Zealand first. Likely not what the user really wanted.

So, in the end I decided to just give up and parse every line of the CSV file as we searched for the country/region name supplied by the user. I guess we can think of this as “progressive enhancement.”

This may seem a waste of time. Yes, users are often unpredictable. But, for most of these exercises we are the users, so may not need a lot of babysitting.

Parsing Lines of CSV

Rather than build our own parsing code, more code than I want to create, especially given Python provides a module for just this purpose. Do note, there are numerous CSV packages.

We will have to add an import statement to our file: import csv. Then we will modify get_file_lines() to parse each line as it processes the csv file. We will then check to see if the user specified country/region name and year match the specific fields in the parsed line. csv.reader() iterates over the lines of the specified file, parsing each line and returning a list (of strings) containing the fields found in the line.

When we find lines match our criteria, we will save the pertinent information to our lines list. But, unlike previously when we saved the whole line to the list, we will save the list returned by csv.reader(). Which means we will need to convert the population data to a float when using it to build the data to plot the chart. This is what I came up with:

ef get_file_lines(country, year, file_nm):
  lines = []
  found_pair = False
  csv_fl = open(csv_nm, 'r')
  r = csv.reader(csv_fl, delimiter=',', quotechar='"')
  for row in r:  
    # don't forget to account for user typing preferences
    # and rcall that year is passed in as a string
    found = country.lower() == row[NM_FLD].lower() and year == row[YR_FLD]
    if found:
      found_pair = True
      lines.append(row)
    if found_pair and not found:
      break
  csv_fl.close()
  return lines

A quick test seems to indicate the code works as expected. So, I am going to commit this change before fixing the other functions to account for the use of a list, conversion to floats, etc. You know, atomic, unambiguous commits.

Now to modify get_pop_data() to use the modified data source returned from get_file_lines(). First I have moved the constants identifying the location of the age group and total population fields in the data from get_pop_data() into the global namespace for the file. I also capitalized their names (Python best practice) and added a couple more.

# indices to certain fields in the CSV file data
X_NM = 1
X_YR = 4
X_AGE_GRP = 6
X_TOT_POP = 11

Ok, let’s modify the “search” code to reflect we are comparing individual data against individual fields. I.E. apples to apples.

    if found_pair and (country not in curr_line or year not in curr_line):
      break
    if country in curr_line and year in curr_line:

Should become something like:

    if found_pair and (country.lower() != curr_line[X_NM].lower or year != curr_line[X_YR]):
      break
    if country.lower() == curr_line[X_NM].lower and year == curr_line[X_YR]:

And, we don’t need to split() the line any more so we can remove those lines from the function. And finally modify our line saving the total population to the dictionary keyed on the age group. So, I end up with:

def get_pop_data(country, year, src_data):
  age_group_data = {}
  max_rows = len(src_data)
  # to track whether or not we found the country and year in the first place
  found_pair = False
  
  for i in range(max_rows):
    curr_line = src_data[i]
    # if we've previously found a line or lines of interest, but the current line is not, leave the loop
    # note the brackets around the check for country or year, want to make sure the logic is correct
    # brackets ensure that that block of code is calculated together before any other operations
    if found_pair and (country.lower() != curr_line[X_NM].lower or year != curr_line[X_YR]):
      break
    if country.lower() == curr_line[X_NM].lower and year == curr_line[X_YR]:
      # record that we found a line with the country and year of interest
      found_pair = True
      # save the stuff we want to our data dictionary
      age_group_data[curr_line[X_AGE_GRP]] = float(curr_line[X_TOT_POP])

  return age_group_data

Now you likely see the above could be tidied up considerably. We don’t need to assign curr_line since we are no longer splitting the line into pieces based on commas. That’s already been done when we got the lines of interest from the CSV file. Which also means we don’t need to index our loop on the current row number, i. And, if we did things right, we won’t end up with any lines for any country/region we didn’t specify. In fact, the only other thing we will need to consider is that we got no lines at all back from get_file_lines(). Which also means we don’t need all those arguments in the function call. And, we will need to modify the call to this function in process_plot() to pop_data = get_pop_data(lines).

So, the following should be more than sufficient:

def get_pop_data(src_data):
  age_group_data = {}
  
  for row in src_data:
    # save the stuff we want to our data dictionary
    age_group_data[row[X_AGE_GRP]] = float(row[X_TOT_POP])

  return age_group_data

Once again, quick test says things are working more or less as expected. So, let’s commit the current changes before proceeding.

Fix Country/Region Name in Chart Title

Now if you tested your changes using an improperly cased name, e.g. braZil, you will likely have noticed that the title for the chart displays that incorrectly cased country/region name. Let’s sort that before continuing. Once again, do we need to fix this. For ourselves likely not, but if anyone else is going to use this code, then absolutely yes we do. And, afterall we are learning to code, presumably to write applications others will use.

Turns out this is fairly straight forward. If you look at the function we use to produce the chart after getting the user input, process_plot(), you will notice a couple of things. Firstly, the call to actually draw the chart, plot_bar_chart(), takes as one of its parameters the country name. So, if we pass the proper name to the function the chart will display it correctly.

Secondly, before plotting the chart, we call get_file_lines() to obtain all the relevant lines from the data CSV file for the given country and year. The lines are returned as an array of arrays. Each line being the information for a given age range. And, each such line contains the proper country or region name. We can obtain that name by getting the value at the appropriate array element for any of the lines. This happens to be the index we have coded as the constant X_NM.

So, in our code, we need to pass this proper name to the plotting function rather than the name the user input, which may or may not be properly cased.

A couple of gotchas. We need to ensure we did get lines back from the function get_file_lines(), before we attempt to access the country name. Or the program will crash. And we need to make sure the country name variable we are going to pass to plot_bar_chart() exists before the call.

I added the following lines to process_plot():

  proper_c_nm = country
  if lines:
    proper_c_nm = lines[0][X_NM]

With process_plot now looking like:

def process_plot(country, year, fl_nm):
  lines = get_file_lines(country, year, fl_nm)
  proper_c_nm = country
  if lines:
    proper_c_nm = lines[0][X_NM]
  pop_data = get_pop_data(lines)
  plot_bar_chart(proper_c_nm, year, pop_data)

My simple tests indicate it appears to work as desired. So, I am going to commit the changes I have made before continuing.

No Data Found?

Currently, if no data is found for the specified country/region and year a blank/empty bar chart is displayed. This doesn’t seem like the best choice. I think it would be better to display a message in the console indicating that no data was found for the combination provided.

Given the changes we just made to display the proper country/region name, it should be easy enough to do just that. We can, in process_plot(), move the lines for data manipulation and chart plotting into the if lines: block. Then add an else: block with code to display a suitable message in the console window. That also means we will be able to get rid of the line: proper_c_nm = country as we won’t ever call plot_bar_chart() if proper_c_nm isn’t defined.

I know we should be able to add colour to make the message stand out, but I am not sure that the code would work across different operation systems or even over different consoles in Windows (e.g. cmd.exe or powershell). So, for now I am not going to bother trying to do so.

My process_plot() now looks like this:

def process_plot(country, year, fl_nm):
  lines = get_file_lines(country, year, fl_nm)
  if lines:
    proper_c_nm = lines[0][X_NM]
    pop_data = get_pop_data(lines)
    plot_bar_chart(proper_c_nm, year, pop_data)
  else:
    print(f"\n! No data found for '{country}' and '{year}'.\n")

Note the extra \ns in the message text. Wanted the line to stand out a little — given the absence of colour.

Simple test says it works. So time to commit the most recent changes.

Next Steps

I was planning to cover more in this post, but I think it covers more than enough at the moment. So, new plan.

What I was going to look at was providing the user a list of available countries/regions so they didn’t have to guess. Of course the list is huge, so will have to limit in some fashion. E.G. by letter of the alphabet, or by multiple letters to make list even smaller. This is also somewhat connected to another issue.

How do we deal with the case of ‘United States’ not being the country of the ‘United States of America’. Thought about this for awhile. Decided I am not going to guess, or ask for more input. Instead, I will show the country/region found first in the title for the chart. The user can then decide how to deal with the situation. Perhaps ask for a list of the names beginning with that letter of the alphabet.

I will look at the above in the next post. That wild goose chase I mentioned at the top of this post will just have to wait for another day. This is certainly going through more hoops and loops than I expected. Fun, eh?