I wasn’t planning to cover my effort to put all the rainfall data from rainGauge_2019.inc.php into the database. But, have decided to do so as it is likely relevant to the project as a whole. And, it will certainly test the state of my current classes and modules.

I had thought I’d start a new module for this, but the prior data2db.py is currently pretty empty except for some development/test code. So I will just continue to use it.

Process Complete Rain Gauge File

As mentioned, the rain gauge file contains data for the period 2014.03.01 to 2019.12.31. I expect processing that file in total will give me plenty of test data for the next step in my code development.

Basic Setup

Some new imports. Instantiate the two classes. And see how things look. I had thought about putting the code in the if chk_i_rf: block into a method in the Weather_db class. But wasn’t sure that was a good idea as I am only using it for development purposes. So, it has stayed where it is.

... ...
from utils.dates_dn import Dates_done
from utils.weather_db import Weather_db
from utils.rg_data import get_rg_dsrcs, get_rg_data, parse_rg_row
... ...
if __name__ == "__main__":
... ...
  do_rg_2019 = True

  cwd = Path(__file__).cwd()
  d_srcs = get_rg_dsrcs()
... ...
    # set up generator for first data file
    gen_d = get_rg_data(d_srcs[0])

    # instantiate dates done class using non-existent csv file
    cwd = Path(__file__).cwd()
    fl_pth = cwd/"data"
    fl_nm = "dates_done.csv"
    dd_pth = fl_pth/fl_nm
    dts_dn = Dates_done(dd_pth)
    # probably should add __repr__ or __str__ to both classes
    print(f"\ndts_dn:\n\tf_pth: {dts_dn.fl}, c_init: {dts_dn.c_init}\n\tdts: {dts_dn.dts}")

    # instantiate database class using empty database file
    cwd = Path(__file__).cwd()
    fl_pth = cwd/"data"
    fl_nm = "weather.db"
    db_pth = fl_pth/fl_nm
    rfall = Weather_db(db_pth)
... ...
    if chk_i_rf:
      for c_tnm in rfall.tnms.values():
        rslt = rfall.get_tbl_rw_cnt(c_tnm)
        print(f"\nrows in {c_tnm} table: {rslt}")

        if c_tnm == "rainfall":
          nx_q = f"""SELECT substr(datetime, 1, 7) as c_month, MIN(daily) as minMonth, MAX(daily) as maxMonth
            FROM {c_tnm}
            GROUP BY c_month;"""
        elif c_tnm == "rf_mon_history":
          nx_q = f"SELECT month, avg FROM {c_tnm}"
        mn_data = rfall.qry_pd(nx_q)
        print("\n", mn_data)

Let’s run that bit and see what we get.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

dts_dn:
        f_pth: R:\learn\dashboard\data\dates_done.csv, c_init: {}
        dts: {}

rows in rainfall table: 0

 Empty DataFrame
Columns: [c_month, minMonth, maxMonth]
Index: []

rows in rf_mon_history table: 0

 Empty DataFrame
Columns: [month, avg]
Index: []

Not particularly enlightening, but definitely what I expected.

Process Row of Data

I thought this step needed a function/method or two, rather than coding it all in the file processing loop. But, because I wasn’t too sure how to go about that at this point, I just coded everything in the loop iterating over the generator. But lots of debugging.

At first I forgot to parse the data returned by the generator. Then I mishandled the returned data. That took me a minute or three to sort out.

Another issue was rows where a part of the date or the rainfall amount had question marks. E.G. 2018.02.18 24:00: ???mm. Month total: 117.5mm (snow fall). And, I was not correctly handling the situation for when the generator was done/empty. I was also catching a bad row at the bottom of the file that was commented out (PHP multi-line comment): <br>2019.??.01 24:00: ?.0mm. Month total: ?.0mm. It was contained in the block of code I copied to start a new month of data recording.

Some bugs in the class modules as well. I will let you sort those out.

I ended up adding a bunch of checks to the code in the loop. Refactoring is definitely required. I won’t go through all the debugging details. Just going to present the code that eventually worked.

  if do_rg_2019:
    chk_i_rf = True
    # I put the file processing loop in an if block for debugging purposes
    proc_fl = True

    # set up generator for first data file
    gen_d = get_rg_data(d_srcs[0])
    e_yr = d_srcs[0].name[10:14]

    # instantiate dates done class using non-existent csv file
    cwd = Path(__file__).cwd()
    fl_pth = cwd/"data"
    fl_nm = "dates_done.csv"
    dd_pth = fl_pth/fl_nm
    dts_dn = Dates_done(dd_pth)
    # probably should add __repr__ or __str__ to both classes
    # print(f"\ndts_dn:\n\tf_pth: {dts_dn.fl}, c_init: {dts_dn.c_init}\n\tdts: {dts_dn.dts}")

    # instantiate database class using empty database file
    cwd = Path(__file__).cwd()
    fl_pth = cwd/"data"
    fl_nm = "weather.db"
    db_pth = fl_pth/fl_nm
    rfall = Weather_db(db_pth)
    # print(dts_dn, rfall)

    if proc_fl:
      # let's process the data in the first rainGauge file month by month
      # add too database, record dates done
      # for test purposes I am going to end after 4 months are processed
      p_mn, c_mn, e_mn = "", "", "2014.07"  # prev mon, curr mon, end mon
      rf_rws, t_mn, m_dts = [], 0, {}       # rainfall rows, tot mon, mon dates
      s_tm = time.perf_counter()
      e_dt = f"{e_yr}.??.01"

      for d_rw in gen_d:
        # is generator done
        gen_dn = e_dt in d_rw
        # parse data row
        mtch = parse_rg_row(d_rw)
        if not mtch:
          print(f"{d_rw} => {mtch}")
          if not gen_dn:
            continue
        else:
          d_dt, d_tm, d_rf = mtch.group(1), mtch.group(2), mtch.group(3)
        # just starting, set curr mon (yyyy.mm)
        if not c_mn:
          c_mn = d_dt[:7]
          m_dts[c_mn] = []
          print(f"init c_mn: {c_mn}")
        # if starting new month, update rainfall table and dates done,
        # set/reset variables accordingly
        if d_dt[:7] != c_mn or gen_dn:
          num_rws = rfall.add_rainfall(rf_rws)
          print(f"rows added to rainfall table: {num_rws}")
          print(f"adding {c_mn} to dates done dict: {m_dts}")
          dts_dn.add_month(m_dts)

          rf_rws, t_mn = [], 0
          p_mn, c_mn = c_mn, d_dt[:7]
          m_dts = {c_mn: []}
          print(f"\ncurrent: p_mn {p_mn}, c_mn {c_mn}")

        # otherwise process current data and add to list for the month
        if not gen_dn and "??" not in d_dt and "??" not in d_rf:
          match(d_tm):
            case "pm":
              dttm = f"{d_dt} 16:00"
            case "am":
              dttm = f"{d_dt} 08:00"
            case _:
              dttm = f"{d_dt} {d_tm}"
          try:
            drf = float(d_rf)
          except ValueError:
            raise(ValueError(f"could not convert string to float: {d_rf}"))
          t_mn += drf
          rf_rws.append((dttm, drf, t_mn))
          m_dts[c_mn].append(d_dt[-2:])

      # update dates done csv file
      dts_dn.updt_csv_file()

      e_tm = time.perf_counter()
      print(f"run time: {(e_tm - s_tm):.4f} secs")

    if chk_i_rf:
      for c_tnm in rfall.tnms.values():
        rslt = rfall.get_tbl_rw_cnt(c_tnm)
        print(f"\nrows in {c_tnm} table: {rslt}")

        if c_tnm == "rainfall":
          nx_q = f"""SELECT substr(datetime, 1, 7) as c_month, MIN(daily) as minMonth, MAX(daily) as maxMonth
            FROM {c_tnm}
            GROUP BY c_month;"""
        elif c_tnm == "rf_mon_history":
          nx_q = f"SELECT month, avg FROM {c_tnm}"
        mn_data = rfall.qry_pd(nx_q)
        print("\n", mn_data)

And a lot of crap in the terminal. Most of which I have cut out.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py
init c_mn: 2014.03

rows added to rainfall table: 7
adding 2014.03 to dates done dict: {'2014.03': ['09', '17', '20', '23', '26', '29', '30']}

current: p_mn 2014.03, c_mn 2014.04

rows added to rainfall table: 9
adding 2014.04 to dates done dict: {'2014.04': ['04', '06', '07', '09', '18', '20', '23', '25', '29']}

current: p_mn 2014.04, c_mn 2014.05

rows added to rainfall table: 6
adding 2014.05 to dates done dict: {'2014.05': ['05', '11', '19', '23', '28', '29']}

current: p_mn 2014.05, c_mn 2014.06

rows added to rainfall table: 6
adding 2014.06 to dates done dict: {'2014.06': ['12', '13', '17', '20', '24', '30']}

current: p_mn 2014.06, c_mn 2014.07

rows added to rainfall table: 2
adding 2014.07 to dates done dict: {'2014.07': ['21', '25']}

... ...

current: p_mn 2019.09, c_mn 2019.10

rows added to rainfall table: 14
adding 2019.10 to dates done dict: {'2019.10': ['02', '03', '04', '07', '08', '12', '15', '16', '17', '18', '19', '20', '21', '25']}

current: p_mn 2019.10, c_mn 2019.11

rows added to rainfall table: 12
adding 2019.11 to dates done dict: {'2019.11': ['09', '10', '11', '12', '13', '15', '16', '17', '18', '19', '23', '25']}

current: p_mn 2019.11, c_mn 2019.12
<br>2019.??.01 24:00: ?.0mm. Month total: ?.0mm => None

rows added to rainfall table: 27
adding 2019.12 to dates done dict: {'2019.12': ['01', '02', '03', '04', '06', '07', '08', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '23', '24', '25', '26', '27', '28', '29', '30', '31']}

current: p_mn 2019.12, c_mn 2019.12
run time: 3.8223 secs

rows in rainfall table: 833

     c_month  minMonth  maxMonth
0   2014.03       3.0      54.0
1   2014.04       1.5      53.0
2   2014.05       5.0      54.0
3   2014.06       2.0      14.5
4   2014.07       3.0      23.0
..      ...       ...       ...
64  2019.08       0.3      24.9
65  2019.09       0.3      29.2
66  2019.10       0.3      49.5
67  2019.11       0.5      48.0
68  2019.12       0.3      70.6

[69 rows x 3 columns]

rows in rf_mon_history table: 0

 Empty DataFrame
Columns: [month, avg]
Index: []

But it appears to have worked. I didn’t manually verify that it in fact did. But the minimum and maximum for 2019.12 are correct.

Note: I realized after completing the post, that I was not in anyway checking to make sure I do not attempt to add duplicate data to the rainfall table. Will need to fix that before I try adding the data in the next file.

Historical Statistics

Okay, let’s see if we can figure out how to get the average, minimum and maximum rainfall for the twelve months in aggregate over the available years of data. That is, for example, we want that data calculated over the monthly totals for March from 2014 to 2019. The monthly total rainfall would be the maximum value in the month to-date rainfall column for each month and year combination. (Grammar?)

SQL Query to Get the Statistics

I plan to use a nested subquery to get the data I am after. If possible I will use it to populate the historical data table. But, I am going to start with the query that gets the monthly total for each year/month combination in the database. Should be easy enough.

  if do_rg_2019:
    chk_i_rf = True
    proc_fl = False
    mk_hist = True
... ...
    if mk_hist:
      chk_i_rf = False
      nx_q = f"""SELECT substr(datetime, 1, 4) as year, substr(datetime, 6, 2) as month, MAX(monthly) as totMonth
        FROM rainfall
        GROUP BY year, month;"""
      mn_data = rfall.qry_pd(nx_q)
      print("\n", mn_data)

In the terminal I got the following.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

     year month  totMonth
0   2014    03    174.25
1   2014    04    116.25
2   2014    05    143.00
3   2014    06     41.50
4   2014    07     26.00
..   ...   ...       ...
64  2019    08     34.60
65  2019    09    191.20
66  2019    10    155.00
67  2019    11    121.10
68  2019    12    302.60

Which looks like what we are after. And those values match those I have in a spreadsheet.

That nested query took a bit of time to sort out. For the main query, I was going to use the datetime column from the nested query to select the rows to be used for the aggregate functions (average, minimum and maximum). Which meant an additional subquery as the WHERE clause selecting the appropriate rows only expected a single column in the subquery it was using for the WHERE clause.

So, I added selecting the datetime column to the above query. Then nested the three queries to get the data I wanted. Note that the GROUP BY clause for the primary query follows the subqueries. After much debugging, here’s the final query.

      nx_q = f"""SELECT substr(datetime, 6, 2) as mon, AVG(monthly) as avg, MIN(monthly) as min,
          MAX(monthly) as max
        FROM rainfall
        WHERE datetime IN
        (SELECT datetime FROM
        (SELECT datetime, substr(datetime, 1, 4) as year, substr(datetime, 6, 2) as month,
          MAX(monthly) as totMonth
        FROM rainfall
        GROUP BY year, month))
        GROUP BY mon;"""
      mn_data = rfall.qry_pd(nx_q)
      print("\n", mn_data)

In the terminal I got the following.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

    mon         avg    min    max
0   01  217.950000   99.9  319.1
1   02  143.100000  110.1  202.5
2   03  198.666667   52.5  387.5
3   04  128.766667   35.5  210.0
4   05   71.421667   11.2  143.0
5   06   46.900000    5.5   81.9
6   07   26.716667    2.8   58.9
7   08   28.233333    7.1   78.5
8   09  115.383333   54.0  191.2
9   10  179.841667  121.1  260.5
10  11  275.591667  121.1  350.0
11  12  248.650000  169.3  339.0

Manually comparing that to data in the aformentioned spreadsheet, the min and max columns match that data. I have not checked the avg column. But, given the others are correct, I am going to assume it is also correct. Well, I did in fact eventually check a couple of them.

Done

I think I am done with this post. I have gotten a fair bit of code sorted, written and debugged over the last day or two. So, I think I will leave getting that table populated until the next post. And figure out where I am going next. Likely adding more data from the rain gauge files. Then tackle the daily notes files (a whole new set of surprises I am sure).

I have again been enjoying this endeavour. I am finding it somewhat challenging. I am content to take my time and resolve bugs. I am willing to experiment until I achieve a goal (interim or otherwise). A touch surprised with my calmness, but that’s a good thing.

Until next time, may you find serenity when tackling your coding challenges.

Resources

SQLite Subquery
SQLite Subquery: Mastering Database Queries for Optimal Performance
How To Use Nested SELECT Queries in SQLite

Follow-up

After tentatively completing this post, I checked some of the monthly averages. They didn’t add up. So I coded a query to give the total monthly rainfall for each month by year. For example:

            datetime  year month  totMonth
0  2015.01.25 08:45  2015    01    253.00
1  2016.01.30 09:28  2016    01    195.25
2  2017.01.30 09:26  2017    01     99.90
3  2018.01.31 24:00  2018    01    319.10
4  2019.01.31 24:00  2019    01    202.50

In my spreadsheet, the value for 2016.01 was 215.25. There were other similar disparities for some of the months. So I started looking at what was in the rain gauge data file. Comparing that to what was in the spreadsheet. Yes, I was duplicating effort/data/noise/trouble/etc. Turns out I had failed to put data for 2016.01.31 into the rain gauge file that I had in the spreadsheet.

In at least one case I had two entries for the same date with differing times. The second included the amount in the first. Okay for me, not so good for simple parsing code. In another case I had two entries for the same date and time, both of which should have been included in the table. Only one was recorded. In yet another, I had a data row quoted out, at least for PHP. My parsing loop processed it. Since it knows nothing about quotes being used to hide data. In another case I had the row <br>2016.12.31 09:54 & 24:00: 0.5mm + 11.4mm. Month total: 170.9mm in the data file. The parser was only getting the 0.5mm amount for that day’s rainfall.

I edited all the iffy bits. In fact I added “??” to the dates for rows I wanted ignored. Taking advantage of one of the checks in the processing loop. I than reran all the bits of development/test code to regenerate the data shown in the main post content above. That did take me a bit of time. Expect I will have go through the other two files looking for similar problems. Fortunately that will only be about 15 months of data or so.

Too Old To Code

Dashboard IV