Last post we dealt with some of the things a user might do that would affect our ability to provide them with a nice bar chart of a country’s or region’s population by age group for a given year. But, there is still the issue of how the user can find out exactly what countries and/or regions are available.

Thoughts About How To Do This

We will have to add something to the menu to allow them to list available country/region names based on a set of characters they provide. If the give us a ‘u’ we will display all the names beginning with a ‘u’. If they give us a ‘un’ we will display all the names beginning with ‘un’. Etc.

Seems simple enough. But there are a number of considerations.

Do we read the data CSV file every time to get the appropriate names? Seems to be somewhat repetitive and expensive time wise. The time needed being dependent on where the specified characters fall alphabetically. The CSV file is in alphabetical order. If we do, will the user be happy waiting that extra time whenever they want a new list? And, I do appreciate that the time involved might be small; but, time is relative and different for everyone experiencing it.

If we don’t want to read the CSV file every time, what are our options? The first is we read the data file once, and generate a new, smaller file that only has each name listed once and none of the population related data. One name per line. That shold reduce the size of the file considerably. There are currently a number of lines for each year and a number of years for each country/region name. I’ll let you do the arithmetic. We then search through that for each request. Should be somewhat faster than using the data CSV file each time.

Then of course, the first time we need to access the name file, we could read the whole thing into a data structure (e.g. a dictionary) in memory. Then use the in-memory data structure for all future searches. Should be a lot faster than reading a file every time. And, most modern PCs and laptops shouldn’t have an issue memory wise. Low end cell phones I don’t know. But, I don’t expect anyone to be running this app on their cell phones.

This latter idea of course begs the question as to how should the file of names be structured. One name per line is fine, but perhaps something like JSON which would allow the data structure to be predefined in the file would be a better choice.

Let’s Try Things Out

Up until now, all the code for this app has resided in a single python file. That is not really the best coding practice I am told. Modules and packages are apparently the name of the game. And, since this search and display facility is really a completely separate unit of functionality, it seems like a good prospect for a module. Don’t know that we need a package. But that may change with time. And, we may eventually convert our current single app file, population_by_age.py, into one or more modules — as our knowledge grows.

Any Python (.py) file is a module, and a bunch of modules in a directory is a package.

Dead Simple Python: Project Structure and Imports , Jason C. McDonald

So, let’s create a new Python file, rc_name_db.py in our project directory. May not be the best name. Can always change it down the road. Now let’s play with some of the ideas discussed above.

Text File With One Name per Line

To start let’s write a function to read the complete population CSV file and record each different country or region name in a new file, one name per line. Once done, we’ll run it and start playing with searching the new file. For now I am calling the function gen_file_list(). I am also going to give it a fl_nm parameter with a default of cr_list.txt. Where cr stands for country/region.

We need to open the population CSV file for input, the cr_list.txt file for output, traverse the data file line by line, and write each new country/region name to our name list file. Because at least one of the names in the data file has commas in it, we can’t do a simple split on each line of the data file.

We will once again need to use the CSV module. So, let’s import that at the top of our module file before getting on with the real work. We’ll also add variables/constants to the top of our file with the names for the population data file and the default value for the name list file.

We will, as usual, also use the context manager for opening our files using the with statement. And we will need to track the last name we wrote to the name list file. So far we have something like the following:

import csv

CSV_NM = 'WPP2019_PopulationByAgeSex_Medium.csv'
LST_NM = 'cr_list.txt'

def gen_file_list(fl_nm=LST_NM):
  with open(CSV_NM, 'r') as csv_fl, open(fl_nm, 'w') as lst_fl:
    #print(csv_fl.readline())
    r = csv.reader(csv_fl, delimiter=',', quotechar='"')
    curr_rc_nm = ''
    for row in r:
      pass

The rest should be fairly simple. Check if name in current line is different from the previous name. If so, update the previous name, write the new name to the list file and go to the next line if it exists. And we end up with something like:

def gen_file_list(fl_nm=LST_NM):
  with open(CSV_NM, 'r') as csv_fl, open(fl_nm, 'w') as lst_fl:
    #print(csv_fl.readline())
    r = csv.reader(csv_fl, delimiter=',', quotechar='"')
    curr_rc_nm = ''
    for row in r:
      if curr_rc_nm != row[1]:
        curr_rc_nm = row[1]
        lst_fl.write(f"{row[1]}\n")

Now to test things out.

Testing Within Modules

Because we are writing a module we need to ensure that any code we use for testing is not going to run when the module is loaded by another module or program. Python has us covered.

A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended. Within a module, the module’s name (as a string) is available as the value of the global variable __name__.

The important thing is that the value of __name__ depends on the context within which the module is run. More importantly, if we run a module independently, that is we don’t import it, __name__ will be equal to 'main'. You can confirm that by printing the value of __name__ at the bottom of your module. E.G. print(f'__name__ is {__name__}'). Then run the module file. You should get __name__ is __main__ printed out in the console window.

If you now import rc_name_db in your main application file and run it, you should see __name__ is rc_name_db printed in the console. Ok, let’s get rid of that test code and continue.

You should now realize that if we include our test code in an if block that checks the value of __name__, we will be able to control when or if that code gets executed. So let’s add the following to our module.

if __name__ == "__main__":
  print(f'__name__ is {__name__}')

Now if you repeat the tests (solo execution vs import), you should only see __name__ is __main__ printed to the console in the solo execution test. You should not see it when executing the module with import rc_name_db.

For my purposes, I also don’t want to regenerate the list file if it already exists. So in the test code I will first check if the file exists. If it doesn’t I will generate it. If it does, I won’t. I will also print messages advising the decision made so that I know it is working as expected. So, the test code, at the bottom of the module, now looks like the following.

Note, I added import pathlib to the top of the file.

if __name__ == "__main__":
  if pathlib.Path(LST_NM).exists():
    print(f"'{LST_NM}'' already exists. So, not generating country/region name list file.")
  else:
    print(f"'{LST_NM}'' does not exist. Generating country/region name list file.")
    gen_file_list()

Ok, in my case cr_list.txt is a mere 11 KB compared to the over 116 KB of the population data file. Should make searching a touch faster. So, let’s give that a try.

You may have noticed a new folder in your project directory: __pycache__. This is a default behaviour of Python when working with imported modules and packages.

To speed up loading modules, Python caches the compiled version of each module in the __pycache__ directory under the name module.version.pyc, where the version encodes the format of the compiled file; it generally contains the Python version number. For example, in CPython release 3.3 the compiled version of spam.py would be cached as __pycache__/spam.cpython-33.pyc. This naming convention allows compiled modules from different releases and different versions of Python to coexist.

6.1.3. “Compiled” Python files

Let’s make sure to commit the work we’ve done so far. I am also going to add cr_list.txt and __pycache__/* to .gitignore before comitting the new module file. Probably should have committed the module file earlier, but…

Searching the File With Our List of Names

Going to add a new function for this, find_names(). It will take a string parameter with the text characters the user wants to match with the beginning of names in our list. I am also, as I did for gen_file_list() including a parameter for the file to search, which will default to our LST_NM file. The function itself will read each line of the file, saving any matching names to a list (array). When the search is complete, the function will return the list of names. The function of course assumes there is just one name and nothing else on each line of the file. Note the use of a list comprehension to get our list of matching names.

def find_names(srch4, fl_nm=LST_NM):
  to_find = srch4.lower()
  with open(fl_nm) as lst_fl:
    names = [line.strip()
              for line in lst_fl
              if line.lower().startswith(to_find)
            ]
  return names

Now let’s run a simple test or two. First, let’s search for ‘Ca’. Add the following code in our if __name__ block (properly indented of course).

  f_str = 'Ca'
  names = find_names(f_str)
  print(f"\nsearching for '{f_str}', found:\n{names}")

The output I get looks like:

searching for 'Ca', found:
['Cabo Verde', 'Cambodia', 'Cameroon', 'Canada', 'Caribbean', 'Caribbean Community and Common Market (CARICOM)']

That seems to work. Now, what if I want to find names related to the United States and search for States?

  f_str = 'States'
  names = find_names(f_str)
  print(f"\nsearching for '{f_str}', found:\n{names}")

The output I get is:

searching for 'States', found:
[]

Well, that’s not so good. Looks like we need to allow for searching for characters within names, not just at the start of names. I propose to do so within the existing function. So, I am going to add an optional parameter that indicates which of the two search methods to use. Like so:

def find_names(srch4, anywhere=False, fl_nm=LST_NM):
  pass

I have put that parameter in the 2nd position because is more likely to not use the default than the fl_nm parameter. At least in our current situation. I will add an if block to the function checking which option has been requested, the default being only check for the search string at the front of the country/region name. And, modify how the search is executed in enclosed if and else blocks accordingly. Something like:

def find_names(srch4, anywhere=False, fl_nm=LST_NM):
  to_find = srch4.lower()
  with open(fl_nm) as lst_fl:
    if anywhere:
      names = [line.strip()
                for line in lst_fl
                if line.find(srch4) > -1
              ]      
    else:
      names = [line.strip()
                for line in lst_fl
                if line.lower().startswith(to_find)
              ]
  return names

Let’s modify the second test to search anywhere and see what we get.

  f_str = 'States'
  names = find_names(f_str, anywhere=True)
  print(f"\nsearching for '{f_str}', found:\n{names}")
searching for 'States', found:
['African, Caribbean and Pacific (ACP) Group of States', 'Commonwealth of Independent States (CIS)', 'Economic Community of Central African States (ECCAS)', 'Economic Community of West African States (ECOWAS)', 'League of Arab States (LAS, informal name: Arab League)', 'Micronesia (Fed. States of)', 'Organization of American States (OAS)', 'Small Island Developing States (SIDS)', 'UNFPA: Arab States (AS)', 'UNITED NATIONS Regional Groups of Member States', 'United Nations Member States', 'United States Virgin Islands', 'United States of America', 'United States of America (and dependencies)']

You may have noticed that when doing the anywhere search I do not force lower case for the search. If you would like to know why, try searching on ‘ca’. I believe my approach is what users would expect.

Think that’s it for today. The next post may prove to be short, but I’d like to do some timing of how long a search takes. Then we need to modify our user interface to allow for searching of the country/region names.

Resources

Python documentation on:

List Comprehensions

Dan Bader:

Bert Carremans