Using Different Samples

I didn’t really want to generate samples in this module then plot them. So, I am going to look at using the samples I saved previously. My plan is to put them into individual files in JSON format. Then use a new command line parameter to specify the file to use. May not be the best plan, but allows me to look at using JSON for a data exchange/storage format.

JSON

”

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

…

JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

These are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages also be based on these structures.

Introducing JSON, json.org

Should be pretty clear by now that JSON is going to look a lot like Python dictionaries filled with variables and lists. Though JSON is a little stricter about the use of quotes — must be “double quotes”. And, no trailing commas following the last item in an array/list or dictionary/hash.

The conversion of data to JSON format is generally referred to as serialization. The term derives from the concept of of transforming data in a series of bytes to be stored or sent over a network (e.g. internet). Hence serial and serialize. The reverse process is all too logicallly referred to as deserialization. But we can just as simply think of it as writing and reading JSON to/from a file. And fortunately for us, Python includes a module providing a JSON encoder and decoder.

Sample Data Contents

Next I looked at what data did I want to have for each collection of repeated samples. The obvious one is the actual mean age for each sample in the collection. Couldn’t really do much without that data. But, I also thought the following would be handy:

individual sample size: used in the title (e.g. 30 countries per sample)
repetition counts: use to determine when to plot the estimated world average age (e.g. 30, 50, 70, 100)
seed: the seed used when generating this sample (just in case, don’t currently need it, but…)
sample standard deviation: in my previous code was already saving that with the sample means, so why recalculate
sample standard error: in my previous code was already saving that with the sample means, so why recalculate

Yes, not currently using the last 3 items, but I had them so figured I’d keep them.

So the json for a single execution of the repeated sampling would like something like (with appropriate values replacing the placeholders below):

{
  "means": [],
  "size": 0,
  "rpts": [30, 50, 70, 100],
  "seed": 0,
  "sd": 0,
  "se95": 0
}

But, I also expect I am going to be saving more than one execution per file. So, I decided I’d use the seed value, as a string, for the key for each individual data collection in the file. Needed something and the same seed value should always result in the same data, so seemed like a good, unique value to use. So, we are looking at something like:

{
  "12792": {
    "means": [],
    "size": 0,
    "rpts": [30, 50, 70, 100],
    "seed": 12792,
    "sd": 0,
    "se95": 0
  },
  "seed2 as str": {
    "means": [],
    "size": 0,
    "rpts": [30, 50, 70, 100],
    "seed": 0,
    "sd": 0,
    "se95": 0
  }
}

You will recall I used the seed 12792 for a few runs of varying sample sizes. So, I thought I’d save myself some typing. The above template is in a file I will simply copy for each file I wish to create.

I should mention that I am not going to write code to serialize the data I had previously written to various files. I am just going to create the files by hand. Which I have done. Here’s two examples. Note, the comments are just to give you the file names, they are not actually in the files. JSON does not support comments.

What’s in each file is from the opening { (no indent) to the closing } (no indent).

# population\play\json\smpl_20_100.json

{
  "12792": {
    "means": [28.75061644818672, 28.227626946591602, 29.655176552152618, 27.865452997176014, 28.103313516908106, 29.359921868314768, 26.591554090327282, 27.61221562674137, 28.983101048294085, 27.47649345533325, 27.250644957551163, 28.87935544036868, 27.448730397385884, 27.9804547225386, 28.343204477535462, 29.424128434415174, 28.316146237190388, 26.517280775267082, 22.979145391015834, 27.467216811322672, 29.77390677633244, 27.72019687907781, 29.486489695764675, 28.630465777845227, 27.92692088355997, 27.87530187651897, 29.476762020721935, 29.324313273028277, 29.541452688873083, 27.929434048463552, 30.637264324383562, 29.918709091529024, 30.41696374758401, 29.432916738544584, 29.006221857378065, 27.652360405188578, 26.94818742722108, 28.77971608946911, 28.897816430968927, 31.98298172242101, 28.04316798576821, 27.99415797047221, 23.90865008559657, 27.20332642629409, 25.4746208684126, 27.731268440272235, 26.994175421720414, 28.246311123846272, 27.657529691231566, 30.656389954529384, 29.28365751905851, 28.966132940615307, 24.713621541842134, 29.615465539305468, 26.329110243377915, 26.62178789370701, 29.035826333188385, 28.19350384367053, 30.288774157632496, 25.793660229439467, 25.701668267634755, 27.040465266761316, 30.692247894951766, 25.788449484993702, 29.68794807855361, 27.306495827527936, 31.17580240381936, 30.69661336910454, 26.218769826642962, 30.37425014132559, 24.746281779967163, 25.946487057135787, 25.35922247639652, 22.936370927588058, 29.57360433099165, 28.855109991398862, 29.68187704258535, 29.217886680171077, 27.734916927133252, 25.04616433795088, 25.991115291255518, 26.190048832961647, 27.453019468361692, 31.86769495038628, 27.054339323463694, 28.375709008048588, 26.49858470550296, 29.509316304616608, 30.169740697075376, 28.325489739957018, 28.176821760090494, 29.681624169943984, 31.036630584714818, 27.83712204884524, 29.12114236472302, 25.634622559835, 29.55621280838424, 30.121894178279337, 27.069271769185544, 30.19987859890068],
    "size": 20,
    "rpts": [30, 50, 70, 100],
    "seed": 12792,
    "sd": 1.8208590877142805, 
    "se95": 0.8521882851108968
  }
}

# and, population\play\json\smpl_30_100.json

{
  "14911": {
    "means": [30.845716086321577, 27.714993270689398, 26.539019466198756, 28.652412751981608, 25.91251105018476, 28.64765858770585, 28.43607425900043, 29.155768024294613, 28.208172328617486, 26.595183258570575, 28.865488929001568, 30.686593914044185, 28.949639621950695, 25.62845914668899, 29.19201774897019, 28.047594512491685, 28.22093542183883, 28.374921165089173, 30.690075634758678, 27.271132349654113, 29.740443263446412, 27.64449438259917, 28.952287681557547, 29.052409562050418, 29.115468450905333, 27.783257243947148, 27.548704530269532, 30.62276598022423, 28.94742189084511, 29.413534278654236, 29.38974836049136, 29.380930696708813, 28.377958686298296, 25.73136328589792, 26.57861731239424, 29.237794824735797, 26.808873073718367, 28.6745133768866, 29.080921150558893, 28.754975906219766, 28.72415124866063, 26.338287455602394, 30.080391362432472, 25.897604085649508, 28.972778954289655, 29.260116185271198, 28.86340953522474, 27.28060642825256, 30.805984223480607, 27.76411635202121, 31.021686055148223, 25.992921118938632, 30.12321928731231, 27.212553791272896, 30.905215921641773, 28.74129849170663, 27.63761576464364, 29.628767187005725, 28.594416579332297, 28.749408044766366, 27.86536302044773, 26.620086966452188, 27.238471545041484, 28.207836094665545, 28.67693576548127, 29.592031794726243, 28.62629156366716, 25.820627686102654, 29.207039853171004, 27.90139482030897, 29.93671731742414, 26.678888773631222, 28.494604917195666, 28.895453164116464, 28.600138218644503, 30.0692463710327, 28.591871496870056, 28.51242379871638, 28.948606671590202, 27.33776348504623, 29.650754154481263, 28.717365896940095, 27.569791814239803, 28.575197523811678, 27.529728084158055, 27.352196701851557, 31.08869246328819, 27.544809201406295, 28.729690047607775, 30.389801507642776, 27.51183455821836, 28.67564013270134, 30.127738834331428, 28.625162276233024, 30.29438580862724, 28.95852097757391, 26.077290603992097, 28.705586201250167, 32.546639208883455, 26.225844583789687],
    "size": 30,
    "rpts": [30, 50, 70, 100],
    "seed": 14911,
    "sd": 1.3856793739027309,
    "se95": 0.5174211817944013
  },
  "12792":{
    "means": [31.28326429954945, 28.191895565034073, 28.727985724366366, 27.70939894167416, 27.656322312061775, 28.824035444817678, 27.47640521301796, 28.96037372910625, 28.385322648153053, 27.338240090736935, 25.056474218971594, 27.963403341818783, 26.697959590465896, 28.372567900100634, 27.999016359658174, 30.297985763794077, 27.87736256976781, 26.757614072254714, 24.81196620027578, 28.89909659641964, 29.414124666889165, 28.027431122526064, 30.563114279146205, 29.74470696719392, 28.179319458130315, 28.04800020512781, 28.049771387215692, 28.701107450310126, 28.888082407658086, 28.986572261932704, 29.585221703494717, 28.80238752512906, 29.86645164304201, 28.600097268947295, 26.39998257271427, 26.229403562416415, 26.451586593137254, 28.602533660899287, 30.15337978372797, 31.258570411318118, 26.378682967452423, 28.29767276233807, 25.80549747058265, 27.75134254710547, 27.02038911742787, 25.6595158289651, 25.7594230541589, 28.79904168260596, 29.97175367316981, 29.58181938016509, 29.11732050766095, 29.107785705204616, 25.476281501311714, 29.06699452403749, 28.72967767551915, 27.124573523425884, 28.025969686660268, 26.931045391239632, 29.42178093851054, 26.25226518588986, 26.408176895537107, 27.941637626388715, 29.55503324332735, 27.665139076223454, 29.356709639249647, 29.68698846392244, 30.260272072210494, 30.796777925195936, 25.962167493639882, 29.81086360686921, 27.952093295106096, 26.088663089666483, 26.92599382220893, 24.157613856044918, 27.973047122485912, 28.684023738865413, 28.692692487362294, 29.972750089540494, 28.491854904773387, 25.883938684992014, 27.014403742093837, 27.56619623789559, 28.564507169592762, 29.58816294996719, 28.05113742809236, 29.0858177703047, 25.255883814330446, 28.55296591744723, 29.003460374679143, 29.709674093370694, 29.220242120646514, 30.24237524597013, 29.96370245185437, 27.54697639090017, 28.91091835475534, 26.792206427781252, 29.02071169953318, 29.916868952748082, 26.33091592411183, 30.043439975066466],
    "size": 30,
    "rpts": [30, 50, 70, 100],
    "seed": 12792,
    "sd": 1.514080829486151,
    "se95": 0.565367073277923
  }
}

Deserializing

I am currently doing my testing in a separate test module. Not my main animation module. First of all let’s import the json encoder/decoder module into our test module.

import json

Next let’s just trying reading in a file and printing some of the data. Because I know where I will be running my module from, I am hard coding the data directory path and the file name. I will eventually add code to get the file name from the command line.

fl_dir = "population/play/json"
fl_in = "smpl_30_100.json"
fl_pth = f"{fl_dir}/{fl_in}"
with open(fl_pth, 'r') as fin:
  samples = json.load(fin)
for s_seed, s_data in samples.items():
  s_size = s_data['size']
  s_rpts = len(s_data['means'])
  s_sum = sum(s_data['means']) 
  mean = s_sum / s_rpts
  print(f"mean for {s_rpts} repeated samples of size {s_size} with seed {s_seed}: {s_sum:.2f} / {s_size} = {mean:.2f}")

And, at the command line I get:

(base) PS R:\learn\py_play> conda activate ani-3.8
(ani-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\json.test.py
mean for 100 repeated samples of size 30 with seed 14911: 2850.78 / 30 = 28.51
mean for 100 repeated samples of size 30 with seed 12792: 2820.76 / 30 = 28.21

Command Line Argument(s)

Okay, let’s add a command line argument to get the name of the file to process. To simplify things I have decided that all such files will reside in a specific directory. And, to eliminate the need to determine a suitable relative path, I am going to use the full path to the directory. Note the lack of a terminating /. And, I am going to set my file name variable to an empty string. I.E. no default file name.

  import json

- fl_dir = "population/play/json"
+ fl_dir = "r:/learn/py_play/population/play/json"

- fl_in = "smpl_30_100.json"
+ fl_in = ""

Now, we need to make sure we have imported argparse (I am still working in my test module). Then we can create a suitable parameter, get the filename and check if it exists. If no argument on the command line, issue error message and exit. Let’s start with the latter. Note that I am returning an exit status within the module. Just ‘cuz I got reminded of that in an on-line course I have recently started working on.

parser = argparse.ArgumentParser()
# long name preceded by --, short by single -, get it as an integer so can use to access test data array
parser.add_argument('--file_name', '-f', help=f'Name of file to process: ')

args = parser.parse_args()

# get file name and check it exists
if args.file_name:
  fl_in = args.file_name
else:
  print(f"File name required!\n\tUseage: {__file__} <data file name>\n")
  exit(1)

A bit of coding convention. Executables or processes (e.g. function call) can and usually do return an exit code or exit status. When you see error messages on screen, there are often farily meaningless “numbers” displayed along with the textual message. That number is often the apps/programs “exit code”. Convention says an exit code of 0 indicates everything is okay. Any other number indicates a condition, error or issue of somesort. The values for the codes and their meanings should be in the documentation somewhere (maybe).

Now let’s make sure the file exists. If not, issue an “error” message and exit. For this I am going to use pathlib, so make sure that is also imported before continuing. Well, or whatever file system module you wish to use.

# if we got a file name, make sure file exists
fl_pth = pathlib.Path(f"{fl_dir}/{fl_in}")
if not fl_path.exists():
  print(f"File name given, {fl_pth}, could not be found.\n")
  exit(1)

# Otherwise go ahead and process the file.

And, a few simple tests:

(ani-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\json.test.py
File name required!
        Useage: R:\learn\py_play\population\play\json.test.py <data file name>

(ani-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\json.test.py -f testing.py
File name given, r:\learn\py_play\population\play\json\testing.py, could not be found.

(ani-3.8) PS R:\learn\py_play> python.exe R:\learn\py_play\population\play\json.test.py -f smpl_50_100.json
mean for 100 repeated samples of size 50 with seed 12792: 2828.87 / 50 = 28.29
mean for 100 repeated samples of size 50 with seed 16588: 2818.37 / 50 = 28.18

Now, if you wish you could add another command line argument for the directory, or refactor the above to take the “full path” for the file name rather than separate the directory path and the file name as I have done.

Add Changes to animated_chart.py

Ok, now it’s time to put the “test code” into the actual animated chart module. “animated_chart.py” in my case. Then more testing. Though we do have to deal with the possibility of more than one dataset in any given file. So a loop and the shifting around of pieces of code to get things to work. I also defined a number of variables outside my loop to ensure they would be available to the animation callback (i.e. variable scope).

But, I am not going to show all that in this post. I will leave it to you how you go about it. But, if you really wish to see my module’s code you can check out this post.

That’s It For This One

Done for another post I think. The next post or two are likely to be about unrelated subjects. One possibly about my use of Python to write a “utility” to solve a problem I was having. The other perhaps about a learning experience while working on last year’s Advent of Code. Until next time, enjoy coding and learning.

Note, I have recently started working on Harvard’s free CS50 course. Figured why not. And will undoubtably learn something new. At this point, 3 weeks in, I highly recommend it. Though I am finding it packs a lot into each week’s lecture. And the weekly labs and/or problem sets take require a fair bit of my time. But there is a good chance I am just slower than most.

Resouces

You’ve seen most of these in the last post.

Introducing JSON
json — JSON encoder and decoder
Working With JSON Data in Python
Parsing Nested JSON Records in Python
Serialize and Deserialize complex JSON in Python
Extract Nested Data From Complex JSON
pathlib — Object-oriented filesystem paths
Exit Status

Too Old To Code

Animating Charts: Part 3