Getting Started

I talked about sometime back about generating our descriptive statistics for a few randomly selected countries. To do that I will need to get a list of the countries of the world. I can’t use the list from the CSV file because it contains a large number of “regions”. I ended up downloading a CSV file from datawookie / data-diaspora. Though it looks like the file was produced by JohnSnowLabs on DataHub.io. I used this one because it included the continent for each country. No idea how accurate it is but… Added it to my py_play/data folder.

My plan is to randomly select and entry from the new CSV file, country-continent-codes.csv, and check to see if the country name exists in our population data CSV file. If so, I will start generating plots, descriptive statistics, etc. for that country and displaying on screen. From there I will capture the plots/info and add to this page or to individual pages I will link to from this page.

pandas

I am going to use pandas, specifically a DataFrame, for reading and storing the country CSV data in memory.

The data file is fairly small and having it in memory will make getting a random country somewhat easier. When reading the CSV I hope to use the DataFrames functionality to only get two columns from the file, continent and country, ignoring the other four columns.

I am going to create a new file, stats_rand.test.py, in my play folder to test things out. First of all, we will need to install pandas in our conda environment.

(base) PS R:\learn\py_play> conda activate base-3.8
(base-3.8) PS R:\learn\py_play> conda install pandas
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

 environment location: E:\appDev\Miniconda3\envs\base-3.8

 added / updated specs:
   - pandas

The following packages will be downloaded:

   package                    |            build
   ---------------------------|-----------------
   ca-certificates-2020.10.14 |                0         122 KB
   pandas-1.1.3               |   py38ha925a31_0         7.5 MB
   ------------------------------------------------------------
                                          Total:         7.6 MB

The following NEW packages will be INSTALLED:

 pandas             pkgs/main/win-64::pandas-1.1.3-py38ha925a31_0
 pytz               pkgs/main/noarch::pytz-2020.1-py_0

The following packages will be UPDATED:

 ca-certificates                               2020.7.22-0 --> 2020.10.14-0

Proceed ([y]/n)? y

Downloading and Extracting Packages
pandas-1.1.3         | 7.5 MB    | ############################################################################ | 100%
ca-certificates-2020 | 122 KB    | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Getting Random Countries

Ok, now that we have the basics in place, let’s use pandas to read the two columns of interest from our new CSV file. Then print out the first few lines of the DataFrame we created with the read_csv method. Don’t forget to import pandas. You might also wish to research the usecols parameter of the read_csv method. Here’s my initial attempt.

import pandas as pd

CSV_FL = 'data/country-continent-codes.csv'
df = pd.read_csv(CSV_FL, usecols = ['continent','country'])
print(df.head(5))

And:

(base) PS R:\learn\py_play> conda activate base-3.8
(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/stats_rand.test.py
Traceback (most recent call last):
  File "r:/learn/py_play/population/play/stats_rand.test.py", line 6, in <module>
    df = pd.read_csv(CSV_FL, usecols = ['continent','country'])
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\site-packages\pandas\io\parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\site-packages\pandas\io\parsers.py", line 452, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\site-packages\pandas\io\parsers.py", line 946, in __init__
    self._make_engine(self.engine)
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\site-packages\pandas\io\parsers.py", line 1178, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "E:\appDev\Miniconda3\envs\base-3.8\lib\site-packages\pandas\io\parsers.py", line 2008, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 537, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas\_libs\parsers.pyx", line 833, in pandas._libs.parsers.TextReader._get_header
ValueError: Passed header names mismatches usecols

The firs tline of of the CSV file is a comment indicating the source of the data. I assumed that the read_csv method would automatically skip comment lines. I suppose I could have removed the line, but the work of others should always be acknowledged. So, I needed to tell the method what a comment line looked liked by adding comment='#' to the parameter list. Alternatively, I could have told it to skip the first line by adding skiprows=1 to the parameter list.

import pandas as pd

CSV_FL = 'data/country-continent-codes.csv'
# first line is comment, second line is header list
df = pd.read_csv(CSV_FL, comment='#', usecols = ['continent','country'])
print(df.head(5))

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/stats_rand.test.py
    continent                                       country
0        Asia              Afghanistan, Islamic Republic of
1      Europe                          Albania, Republic of
2  Antarctica  Antarctica (the territory South of 60 deg S)
3      Africa      Algeria, People's Democratic Republic of
4     Oceania                                American Samoa

I was going to import the random package and use it to select countries at random. But, pandas provides a sample() method on DataFrames which does the work for us. It also allows us to specify how many random samples we want from the data. Let’s get 3 rows.

import pandas as pd

CSV_FL = 'data/country-continent-codes.csv'
# first line is comment, second line is header list
df = pd.read_csv(CSV_FL, comment='#', usecols = ['continent','country'])
# without setting the random state I would get a different set of countries each time
# for testing I want the same each time, will remove at later date
# it will also hopefully mean if you use the same state you will get the values I got
rand = df.sample(n=3, random_state=6)
print(rand)

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/stats_rand.test.py
    continent                       country
19     Europe         Belgium, Kingdom of
89     Europe                   Gibraltar
210    Europe  Slovakia (Slovak Republic)

And, since we are currently only interested in the country name, let’s drop the continent column from our result. Then iterate through the returned object and print only the country name on each line. As you will see, we really don’t need to drop the ‘continent’ column from the returned random sample.

import pandas as pd

CSV_FL = 'data/country-continent-codes.csv'
# first line is comment, second line is header list
df = pd.read_csv(CSV_FL, comment='#', usecols = ['continent','country'])
# without setting the random state I would get a different set of countries each time
# for testing I want the same each time, will remove at later date
rand_r = df.sample(n=3, random_state=6)
# drop returns a copy if inplace=False (the default), I want to change the rand object directly
rand_r.drop(columns=['continent'], inplace=True)
# to test the drop
print(rand_r, "\n")
# to test iterating over returned random countries
for c_nm in rand_r['country']:
  print(c_nm)

(base-3.8) PS R:\learn\py_play> python.exe r:/learn/py_play/population/play/stats_rand.test.py
                          country
19          Belgium, Kingdom of
89                    Gibraltar
210  Slovakia (Slovak Republic)

Belgium, Kingdom of
Gibraltar
Slovakia (Slovak Republic)

I will from here on in select a single country at a time, without using the random_state parameter so that I get a different one each time. I also want to make sure the randomly selected country can be found in our population datbase (CSV). Sorry, going to be duplicating a lot of code from the previous exercise. E.G. chk_name().

New Branch

Been looking at things, and I looks like I will need to add functions to the chart/descriptive_stats.py module. Perhaps to others as well. So, I think I am going to create a new branch. Before I do that I will make sure I have all current pending changes to the main branch comitted or discarded.

Plan of Attack

Basically, for each country in a series of randomly selected countries, I want:

to create a directory,
save an histogram for the mean and related numbers to that directory,
save an histogram for the median and related numbers to that directory, and
save a file (text, markdown or HTML??) with all the descriptive stats in a table to that directory.

I may also build another table showing the descriptive stats for all the selected countries side by each. I will likely try to do this in such a way that each directory will represent a new post/page for my blog. Hopefully saving me some work down the road.

I will while working on the code only use a single country and the same one repeatedly until I have things working to my liking.

So, I will be adding functions to the chart/descriptive_stats.py module to produce and save the two charts, and to generate the complete set of descriptive statistics (as we did in one of our tests earlier). I may add the option to generate the table in various formats (CSV, JSON, HTML, etc.).

I expect this will all take much more than one post. We shall see how that all works out. And whether or not for the better.

chk_name()

I have decided to add this function to the database.rc_names package. Makes more sense than copying it repeatedly. So, a bit of copy and paste. A new test. And, once working, a git commit.

descriptive_stats.py

Okay, I am now going to work, one by one on adding functions to the descriptive_stats.py module/package, and test by using in stats_rand.test.py. Let’s start with one to generate the histogram for the mean of our current country and year. By the way, I am sticking with 2011 as the year — for better or worse. Though I will add a command line argument to stats_rand.test.py to allow me to change that whenever I run the module.

Default Save Directory

I was getting tired of going through the directory structure each time I used the “Save” button on a plot. So, I have added a couple of lines to allow me to define a default save directory for my plots.

PS R:\learn\py_play> git diff head^^ head
diff --git a/population/chart/descriptive_stats.py b/population/chart/descriptive_stats.py
index 20ea255..fad791c 100644
--- a/population/chart/descriptive_stats.py
+++ b/population/chart/descriptive_stats.py
@@ -16,10 +16,13 @@ Functions:
   ()

 """
+import matplotlib as mpl
 import matplotlib.pyplot as plt
 from scipy import stats
 import pathlib

+mpl.rcParams["savefig.directory"] = "R:/hugo/proj_resources/tooOldCode/images"
+
 if __name__ == '__main__':
   import sys
   from pathlib import Path

You will note I apparently have two imports for matplotlib. The first imports the matplotlib package, under the alias mpl, but it does not import any of matplotlib’s submodules. So, still need to import matplotlib.pyplot.

Also, you will note the from pathlib import Path. I am using the pathlib module insted of the os module. If you are wondering why, have a look at “Why you should be using pathlib”.

Function to Plot Histogram Showing the Mean and Such

Let’s work on that first historgram.

Bug Fixes and Annotations

While starting to work on the function to display the mean and such on a suitable histogram, I found some bugs/issues. The bug was that the call to ax.hist() in chart.descriptive_stats.pop_histogram() was using “global” variables rather than defined parameters. It worked during testing because the values were available. But failed when I tried calling the function from another function using local variables. A few small changes, but… Guess I will need to do more thorough testing. Will look at making that a subject for a future series of posts.

I also ran into my usual directory structure issues when executing the module versus importing it. So had to make few changes with respect to the chart.chart package import. Here’s a diff of my commits showing the changes.

PS R:\learn\py_play> git diff head^ head
diff --git a/population/chart/descriptive_stats.py b/population/chart/descriptive_stats.py
index fad791c..e1571b1 100644
--- a/population/chart/descriptive_stats.py
+++ b/population/chart/descriptive_stats.py
@@ -38,7 +38,10 @@ if __name__ == '__main__':

 # import here so above code can run when testing module
 # pylint: disable=wrong-import-position
-import chart
+if __name__ == '__main__':
+  import chart
+else:
+  from chart import chart
 from database import population as pdb


@@ -52,7 +55,7 @@ def pop_histogram(fig, ax, cr_nm, p_yr, labels, wts, bins=range(0, 110, 5)):
   plt.ylabel('Population (1000s)')
   plt.title(f'{cr_nm}: {p_yr}')
   #fig.tight_layout()
-  return ax.hist(b_lbls, weights=p_data, bins=bins)
+  return ax.hist(labels, weights=wts, bins=bins)


 def get_low_grp_bound(g_lbl):

And, I decided I’d like to have annotations available if the chart is displayed (debugging and general interest). So, I added a hist_cb_click() function to provide that facility. I added the necessary lines to the test area to enable the click event. I used chart.pick_create() to add a picker to remove an annotation by clicking on it. Pretty much copy and paste with some editing. Also added new parameter, plot_yr.

def hist_cb_click(ax, fig, rects, plot_for, plot_yr, b_lbls, a_grps):
  """Create callback for click event. Annotation to stay visible until it is clicked, so
     need to create new annotation for each click, unlike the mouse hover effect.
  """

  def callback(event):
    annot = ax.annotate(f"", xy=(0,0), xytext=(-5,20), textcoords="offset points", color='white', zorder=99.9,
                        bbox=dict(boxstyle="round", fc="black", ec="b", lw=2, alpha=0.8),
                        arrowprops=dict(arrowstyle="->"), picker=True)
    annot.set_visible(False)
    
    #print(f'event: {event}')
    #print(f"{'double' if event.dblclick else 'single'} click: button={event.button}, x={event.x}, y={event.y}, xdata={event.xdata}, ydata={event.ydata}")
    is_small_bar = False
    is_visible = annot.get_visible()
    
    if event.inaxes == ax:
      j = 0
      for rect in rects:
        cont, _ = rect.contains(event)
        #print(event, cont, rect.get_x())
        if not cont:
          is_small_bar = (event.xdata >= rect.get_x()) and (event.xdata < (rect.get_x() + rect.get_width())) and (event.y <= 100)
          #print(f"{event.xdata} >= {p_bar.get_x()} and {event.xdata} < {p_bar.get_x()+p_bar.get_width()} and {event.y} <= 100 == {is_small_bar}")
        if cont or is_small_bar:
          b_ht = rect.get_height()
          x = rect.get_x() + rect.get_width()/2.
          y = rect.get_y() + b_ht
          annot.xy = (x,y)
          text = f"{plot_for} {plot_yr}: ({a_grps[j]})\n{b_ht:.3f}"
          #if  text += " %"
          if False:
            pass
          else:
            text += " (000s)"
          annot.set_text(text)
          #print(f'annot: {annot}')
          annot.set_visible(True)
          fig.canvas.draw_idle()
          return
        j += 1
  return callback

# and in the test block

  # display the histogram, again regardless of which test is being run, well almost
  b_lbls = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in LABELS]
  fig, ax = plt.subplots(figsize=(10,6))
  n, bins, patches = pop_histogram(fig, ax, cr_nm, p_yr, b_lbls, p_data)
  onclick = hist_cb_click(ax, fig, patches, cr_nm, p_yr, b_lbls, a_grps=LABELS)
  c_id = fig.canvas.mpl_connect('button_press_event', onclick)
  p_id = fig.canvas.mpl_connect('pick_event', chart.pick_create(fig))

Now back to our mean and median information plotting functions.

hist_mean_plus()

I don’t want the caller to know much of anything about how this function works. They should only be passing in the necessary data and “getting back” a suitable plot of the histogram, with the mean and related values plotted on it. I had various ideas about how to do this and how to handle the matplotlib plot and axes variables for the plot. While looking about, I found Python Plotting With Matplotlib (Guide). If you plan on continuing to use matplotlib, I believe it is worth your time to read it. I really wish I had found it sooner.

As a result, I am going to generate my subplot outside the function in my working code, as has been my pattern todate. Pass in the axes variable as a parameter, or let the function get it use the gca() method (something I saw in another post). It will also use the gcf() to get the currently active figure. Once the plot is generated, I will, in my working code save the plot to the desired location using the values I got for fig and ax when creating my subplots(). In my case, the desired location would be the img directory in the directory containing my post for the current country. I will also, during development, likely display the plot (plt.show()) to give me an idea how things are going.

I haven’t quite sorted how I will handle the next plot, the quartiles and outlier marks. A new figure and axes, or try to use the existing one. Though for the latter I will need to remove the mean and standard deviation lines from it before adding the ones for the quartiles and outlier marks. That might be more fun than creating a new plot. We shall see. And, once you’ve given it a shot, have a look at my version.

def hist_mean_plus(cr_nm, p_yr, b_lbls, b_data, ax=None):
  # The calling program shouldn't care about how this function works. It just wants to get a histogram of the data
  # along with the mean and std devs shown saved to a file.
  # cr_nm: country data is for
  # p_yr: year data is for
  # b_lbls: list of bin labels for the x-axis, assumed to be (0-4, 5-9, ..., 95-99, 100+)
  # b_data: y-value/weight for each bin label
  # ax: the axes of a figure or subplot or None

  # if ax == Null, get current, active axes
  ax = ax or plt.gca()
  fig = plt.gcf()
  b_mids = get_grp_middle(b_lbls, b_width=5)
  s_mean = get_wt_mean(b_mids, b_data)
  s_sdev = get_binned_sd(b_mids, b_data, s_mean)

  h_lbls = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in b_lbls]
  n, bins, patches = pop_histogram(fig, ax, cr_nm, p_yr, h_lbls, b_data)

  plt.axvline(s_mean, 0, 1, color='r', label=f'Sample Mean: {s_mean:.2f}')
  plt.axvline(s_mean + s_sdev, 0, 1, color='c', label=f'Plus/Minus 1 Std Dev: {s_sdev:.2f}')
  minus1 = s_mean - s_sdev
  if minus1 > 0:
    plt.axvline(minus1, 0, 1, color='c')
  plt.axvline(s_mean + (2 * s_sdev), 0, 1, color='m', label=f'Plus/Minus 2 Std Dev: {2 * s_sdev:.2f}')
  minus2 = s_mean - (2 * s_sdev)
  if minus2 > 0:
    plt.axvline(minus2, 0, 1, color='m')
  plt.axvline(s_mean + (3 * s_sdev), 0, 1, color='k', label=f'Plus/Minus 3 Std Dev: {3 * s_sdev:.2f}')
  minus3 = s_mean - (3 * s_sdev)
  if minus3 > 0:
    plt.axvline(minus3, 0, 1, color='k')

  onclick = hist_cb_click(ax, fig, patches, cr_nm, p_yr, b_lbls, a_grps=b_lbls)
  c_id = fig.canvas.mpl_connect('button_press_event', onclick)
  p_id = fig.canvas.mpl_connect('pick_event', chart.pick_create(fig))

  ax.legend()

And in my working code in stats_rand.test.py, I have the following. Though all the code to get a random country is currently missing. I will show you that code somewhere down the road. I will let you sort how you test the function as you see fit.

      # get data
      cy_p_data = pdb.get_1cr_years_all(chk_nm, [use_yr])
      print(cy_p_data)
      
      # plot histogram with mean and std devs
      fig, ax = plt.subplots(figsize=(10,6))
      dstat.hist_mean_plus(chk_nm, use_yr, labels, cy_p_data[use_yr], ax=ax)
      plt.savefig(f'{str(img_dir)}/hist_mean_{clnd}.png')
      plt.show()
      plt.close('all')

hist_median_plus() Cancelled — Refactor Called For

This one of course would pretty much be the same as hist_mean_plus(). Except for calculating median related values and plotting their values on the histogram virtually identical. As such, I have decided to refactor how the module works. I am going to look at adding a new parameter to the histogram plotting function. It will either be the name of a function or the default of None. If a function is passed in, it will be called once the histogram is drawn. The function is expected to add the lines for either the mean related values or the median related values to the histogram. Which should be the currently active Axes (and Figure). And, of course, any such functions passed in must all have the same parameter signature.

Rather than rework pop_histogram(), I am going to define a new function, pop_hist_plus(). Then define two new functions, add_mean_plus() and add_median_plus() to add the appropriate lines and legend to the plot. Give it a shot and see you back here when you’re ready. I will add the calls to the annotation functions to pop_hist_plus().

Here’s the new functions in chart/descriptive_stats.py look like.

def pop_hist_plus (fig, ax, cr_nm, p_yr, labels, wts, bins=range(0, 110, 5), add_lns=None):
  ax = ax or plt.gca()
  fig = fig or plt.gcf()
  plt.xlabel('Age')
  plt.ylabel('Population (1000s)')
  plt.title(f'{cr_nm}: {p_yr}')
  h_lbls = [int(grp.split('-')[0]) if grp[-1] != '+' else int(grp[:-1]) for grp in labels]
  n, bins, patches = plt.hist(h_lbls, weights=wts, bins=bins)

  onclick = hist_cb_click(ax, fig, patches, cr_nm, p_yr, labels, a_grps=labels)
  c_id = fig.canvas.mpl_connect('button_press_event', onclick)
  p_id = fig.canvas.mpl_connect('pick_event', chart.pick_create(fig))

  #fig.tight_layout()
  if add_lns:
    add_lns(labels, wts, ax=ax)


def add_mean_plus(b_lbls, b_data, ax=None):
  ax = ax or plt.gca()

  b_mids = get_grp_middle(b_lbls, b_width=5)
  s_mean = get_wt_mean(b_mids, b_data)
  s_sdev = get_binned_sd(b_mids, b_data, s_mean)

  ax.axvline(s_mean, 0, 1, color='r', label=f'Sample Mean: {s_mean:.2f}')
  ax.axvline(s_mean + s_sdev, 0, 1, color='c', label=f'Plus/Minus 1 Std Dev: {s_sdev:.2f}')
  minus1 = s_mean - s_sdev
  if minus1 > 0:
    ax.axvline(minus1, 0, 1, color='c')
  ax.axvline(s_mean + (2 * s_sdev), 0, 1, color='m', label=f'Plus/Minus 2 Std Dev: {2 * s_sdev:.2f}')
  minus2 = s_mean - (2 * s_sdev)
  if minus2 > 0:
    ax.axvline(minus2, 0, 1, color='m')
  ax.axvline(s_mean + (3 * s_sdev), 0, 1, color='k', label=f'Plus/Minus 3 Std Dev: {3 * s_sdev:.2f}')
  minus3 = s_mean - (3 * s_sdev)
  if minus3 > 0:
    ax.axvline(minus3, 0, 1, color='k')

  ax.legend()


def add_median_plus(b_lbls, b_data, ax=None):
  ax = ax or plt.gca()

  grp_lows = get_grp_low_bdrys(b_lbls)

  s_q2 = get_wt_quartile(grp_lows, b_data)
  s_q1 = get_wt_quartile(grp_lows, b_data, qnbr=1)
  s_q3 = get_wt_quartile(grp_lows, b_data, qnbr=3)
  iqr = s_q3 - s_q1
  low_out = s_q1 - (1.5 * iqr)
  up_out = s_q3 + (1.5 * iqr)

  plt.axvline(s_q1, 0, 1, color='c', label=f'1st Quartile: {s_q1:.2f}')
  plt.axvline(s_q2, 0, 1, color='r', label=f'Sample Median (2nd Quartile): {s_q2:.2f}')
  plt.axvline(s_q3, 0, 1, color='m', label=f'3rd Quartile: {s_q3:.2f}')
  if up_out < 109.0:
    plt.axvline(up_out, 0, 1, color='k', label=f'Upper Outlier Value: {up_out:.2f}')
  if low_out > 0:
    plt.axvline(low_out, 0, 1, color='g', label=f'Upper Outlier Value: {up_out:.2f}')

  ax.legend()

stats_rand.test.py

Okay, I am doing things in this module that you will likely not care about. I am going to get the module to do some of the work of producing my blog posts for me. So, I need to create appropriate directory structures, create a blog post file (index.md in my case) and save the plots to the appropriate image directory. You probably only really need to generate the plots and such for display on screen or at the command line.

My stats_rand.test.py module currently looks like the following.

import pandas as pd
import matplotlib.pyplot as plt
import pathlib

if __name__ == '__main__':
  # if running module directly, need to set system path appropriately so that Python can find local packages
  import sys
  from pathlib import Path
  file = Path(__file__).resolve()
  parent, root = file.parent, file.parents[1]
  print(f"parent: {parent}; root: {root}\n")
  sys.path.append(str(root))
  # Additionally remove the current file's directory from sys.path
  try:
    sys.path.remove(str(parent))
  except ValueError: # Already removed
    pass

# pylint: disable=wrong-import-position
# pylint: disable=import-error
from chart import chart
from chart import descriptive_stats as dstat
from database import population as pdb
from database import rc_names as rcn


def clean_name(c_name):
  cln_nm = c_name
  chk_for = ['people', 'republic']
  for label in chk_for:
    if label.casefold() in c_name.casefold():
      cln_nm = c_name.split(" ")[0]
      break
  if ' ' in cln_nm:
    cln_nm = cln_nm.replace(' ', '_')
  return cln_nm.casefold()

def create_md_file(path, c_name, shrt_nm, use_dt='2020-11-30'):
  # use_dt should match the data for the post I will be using to link to this file
  f_matter = '---\n'
  f_matter += f'title: "Discriptive Statistics: {c_name}"\n'
  f_matter += f'date: {use_dt}T06:10:10-07:00\n'
  f_matter += 'draft: true\n'
  f_matter += 'unlisted: true\n'
  f_matter += 'tags: ["statistics","descriptive statistics","central tendency","dispersion","distribution","shape","variability","skewness"]\n'
  f_matter += '---\n\n'

  init_text = 'I was going to plot the basic histogram before doing anything else.'
  init_text += " But, decided that wasn't necessary as it would be drawn and included at least a couple of times in the post."
  init_text += '\n\n# Mean and Standard Deviation\n\n'
  init_text += f"<img src='img/hist_mean_{shrt_nm}.png' alt='histogram showing mean and standard deviations' loading='lazy' style='width:800px;height:478;margin:-.25em auto;'>"

  init_text += '\n\n# Quartiles and Outliers\n\n'
  init_text += f"<img src='img/hist_quartiles_{shrt_nm}.png' alt='histogram showing quartiles and outlier marks' loading='lazy' style='width:800px;height:478;margin:-.25em auto;'>"

  with open(path / 'index.md', 'w') as lst_fl:
    lst_fl.write(f_matter)
    lst_fl.write(init_text)


if __name__ == '__main__':
  blog_path = 'R:/hugo/tooOldCode/content/post'
  CSV_FL = 'data/country-continent-codes.csv'
  # first line is comment, second line is header list
  df = pd.read_csv(CSV_FL, comment='#', usecols = ['continent','country'])

  use_yr = '2011'
  nms_done = []
  c_cnt = 1
  labels = chart.get_agrp_lbls()     

  while True:
    # get rand name from country array
    rand_r = df.sample()
    c_nm = rand_r['country'].iloc[0]
    t_nm = c_nm
    # tidy version of name from file before continuing
    if ',' in t_nm:
      t_nm = t_nm.split(',')[0]
    if '(' in t_nm:
      t_nm = t_nm.split('(')[0].strip()
    print(f"\n{c_cnt}: Checking for '{c_nm} ({t_nm})':")
    chk_nm = rcn.chk_name(t_nm)
    if chk_nm != '':
      if chk_nm not in nms_done:
        c_cnt += 1
        nms_done.append(chk_nm)
      else:
        chk_nm = ''
    print(f"\t {'did not find' if chk_nm=='' else 'found'} '{t_nm}' in population database")
    if chk_nm:
      ### create directory for this country in blog path, blog_path
      clnd = clean_name(t_nm)
      c_dir_nm = f'rand_{clnd}'
      base_path = Path(blog_path) 
      dir_path = base_path / c_dir_nm
      print(base_path, '\n', dir_path)
      dir_path.mkdir() 
      ### create .md file for this country
      create_md_file(dir_path, chk_nm, clnd)
      ### create img dir under country blog dir
      img_dir = dir_path / 'img'
      img_dir.mkdir()
      
      # get data
      cy_p_data = pdb.get_1cr_years_all(chk_nm, [use_yr])
      print(cy_p_data)
      
      # plot histogram with mean and std devs
      fig, ax = plt.subplots(figsize=(10,6))
      dstat.pop_hist_plus(fig, ax, chk_nm, use_yr, labels, cy_p_data[use_yr], add_lns=dstat.add_mean_plus)
      plt.savefig(f'{str(img_dir)}/hist_mean_{clnd}.png')
      plt.show()

      # plot histogram with quartiles and outlier values
      fig, ax = plt.subplots(figsize=(10,6))
      dstat.pop_hist_plus(fig, ax, chk_nm, use_yr, labels, cy_p_data[use_yr], add_lns=dstat.add_median_plus)
      plt.savefig(f'{str(img_dir)}/hist_quartiles_{clnd}.png')
      plt.show()
      plt.close('all')

    # for testing only generate post/images for 2 random countries
    if c_cnt > 2:
      print()
      break

Sorry, Done for This Post

Way more work than I expected so far. Getting to be a very lengthy post. So, I will leave generating the table of all the descriptive statistics for another day. But, here are the two posts I generated using the above code.

Until next time — 1 week today.

Resources

Matplotlib: Saving plots
Matplotlib: How to change default path for “save the figure” in python?
Customizing Matplotlib with style sheets and rcParams
pandas Getting started
10 minutes to pandas
pandas API reference
pandas.DataFrame.drop
pandas.DataFrame.loc
pandas.DataFrame.sample
pandas.read_csv
How to randomly select rows from Pandas DataFrame
Why you should be using pathlib

Too Old To Code

Discriptive Statistics: A Sample of Countries