Okay before processing any more data, I want to figure out how to use that query we developed last post to populate the historical monthly rainfall table. I do believe an executemany insert should work just fine.

Populate Monthly History Table

Okay, a few steps here. You many not have expected this first one.

New Database Manipulation Method

I am going to delete the current contents of the table before I add the current historical data. As I will generate it each time I wish to update the history table. Seems easier than figuring out which rows to update and which to leave alone. Simplicity over speed.

But for that I will need a new method in the Weather_db class. Didn’t want to refactor qry_nsrt. And, qry_exec doesn’t force a transaction commit. I need that commit to make sure the table contents are deleted before adding any new data. So, a new method, qry_dlt. Not sure that is the most efficient approach. But as things stand, that’s how I am going to go about it.

... ...
def qry_dlt(self, qry):
    """Execute a DELETE query.

    params:
      qry: string with the appropriate query
    
    returns: nothing
    """
    conn = sqlite3.connect(self.db_pth)
    curs = conn.cursor()

    curs.execute(qry)
    conn.commit()

    curs.close()
    conn.close()

Test Code

For now I am going to add this to data2db.py. But, eventually, it will be another method in the Weather_db class; as is fitting. I will execute the nested query and get the results back as a list of tuples. I will use that list as the row data for an executemany insert transaction.

Let’s start with a look at the list returned when executing that nested query we used too get the data we want.

... ...
    if mk_hist:
      chk_i_rf = False
      prn_all_mons = False
      prn_by_mon = False
      prn_12_mons = False
      do_pop_tbl = True
... ...
      if do_pop_tbl:
        rslt = rfall.qry_exec(nx_q)
        print(rslt)

And:

dbd-3.13) PS R:\learn\dashboard> python data2db.py
[('01', 217.95, 99.9, 319.1), ('02', 143.1, 110.09999999999998, 202.5), ('03', 198.66666666666666, 52.49999999999999, 387.5), ('04', 128.76666666666668, 35.5, 210.0), ('05', 71.42166666666667, 11.2, 143.0), ('06', 46.9, 5.5, 81.89999999999999), ('07', 26.71666666666667, 2.8, 58.900000000000006), ('08', 28.233333333333334, 7.1, 78.5), ('09', 115.38333333333333, 54.0, 191.20000000000002), ('10', 179.84166666666667, 121.1, 260.5), ('11', 275.59166666666664, 121.10000000000001, 350.0), ('12', 248.65, 169.3, 339.0)]

Okay now let’s delete the data in the history table.

... ...
      if do_pop_tbl:
        rslt = rfall.qry_exec(nx_q)
        # print(rslt)

        c_tnm = rfall.tnms['mh_tnm']
       
        # empty table, will be full data population
        print("\nDeleting all rows from table {c_tnm}")
        dlt_hist = f"DELETE FROM {c_tnm}"
        rfall.qry_dlt(dlt_hist)
        mn_data = rfall.qry_pd(f"SELECT * FROM {c_tnm}")
        print("\n", mn_data)

In the terminal I got the following.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

Deleting all rows from table {c_tnm}

 Empty DataFrame
Columns: [row_id, month, avg, min, max]
Index: []

Not that that tells us much. The table was empty to begin with. But no errors generated.

Okay, let’s add the code to insert the nested query data into the table. Pretty straightforward, but a few extra print statements during development.

... ...
        # insert all rows from result of query above
        print(f"\nInserting monthly rainfall history into table {c_tnm}")
        in_hist_tbl = f"""INSERT INTO {c_tnm} VALUES (NULL, ?, ?, ?, ?);"""
        rfall.qry_nsrt(in_hist_tbl, rslt, is_many=True)
        c_rws = rfall.get_tbl_rw_cnt(c_tnm)
        print(f"{c_rws} rows added to table {c_tnm}")

        mn_data = rfall.qry_pd(f"SELECT * FROM {c_tnm}")
        print("\n", mn_data)

And, in the terminal (including the data deletion) I got the following.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

Deleting all rows from table {c_tnm}

 Empty DataFrame
Columns: [row_id, month, avg, min, max]
Index: []

Inserting monthly rainfall history into table rf_mon_history
12 rows added to table rf_mon_history

     row_id month         avg    min    max
0        1    01  217.950000   99.9  319.1
1        2    02  143.100000  110.1  202.5
2        3    03  198.666667   52.5  387.5
3        4    04  128.766667   35.5  210.0
4        5    05   71.421667   11.2  143.0
5        6    06   46.900000    5.5   81.9
6        7    07   26.716667    2.8   58.9
7        8    08   28.233333    7.1   78.5
8        9    09  115.383333   54.0  191.2
9       10    10  179.841667  121.1  260.5
10      11    11  275.591667  121.1  350.0
11      12    12  248.650000  169.3  339.0

And, if we run code again, we get the following in the terminal. Look familiar?

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

Deleting all rows from table {c_tnm}

 Empty DataFrame
Columns: [row_id, month, avg, min, max]
Index: []

Inserting monthly rainfall history into table rf_mon_history
12 rows added to table rf_mon_history

     row_id month         avg    min    max
0        1    01  217.950000   99.9  319.1
1        2    02  143.100000  110.1  202.5
2        3    03  198.666667   52.5  387.5
3        4    04  128.766667   35.5  210.0
4        5    05   71.421667   11.2  143.0
5        6    06   46.900000    5.5   81.9
6        7    07   26.716667    2.8   58.9
7        8    08   28.233333    7.1   78.5
8        9    09  115.383333   54.0  191.2
9       10    10  179.841667  121.1  260.5
10      11    11  275.591667  121.1  350.0
11      12    12  248.650000  169.3  339.0

Convert Test Code to Class Method

Okay, let’s get the necessary parts of the above code into a new class method, updt_hist_tbl(). Mostly copy and paste. Change all the references to rfall to self, eliminate the print statements and return the number of rows added to the table (a safety check).

  def updt_hist_tbl(self):
    """Update history table based on current state of rainfall table,
       after deleting current contents of history table.

    params:
      None
    
    returns:
      count of rows added to history table
    """
    c_tnm = self.tnms['mh_tnm']

    # get current historical data from rainfall table
    nx_q = f"""SELECT substr(datetime, 6, 2) as mon, AVG(monthly) as avg, MIN(monthly) as min, MAX(monthly) as max
    FROM rainfall
    WHERE datetime IN
    (SELECT datetime FROM
    (SELECT datetime, substr(datetime, 1, 4) as year, substr(datetime, 6, 2) as month, MAX(monthly) as totMonth
    FROM rainfall
    GROUP BY year, month))
    GROUP BY mon;"""
    rslt = self.qry_exec(nx_q)
    
    # empty table, will be full data population
    dlt_hist = f"DELETE FROM {c_tnm}"
    self.qry_dlt(dlt_hist)
    # insert all rows from result of query above
    in_hist_tbl = f"""INSERT INTO {c_tnm} VALUES (NULL, ?, ?, ?, ?);"""
    self.qry_nsrt(in_hist_tbl, rslt, is_many=True)

    c_rws = self.get_tbl_rw_cnt(c_tnm)
    return(c_rws)

Quick test in data2db.py.

... ...
    if mk_hist:
      chk_i_rf = False      # run simple, local database test code
      prn_all_mons = False  # get all monthly total rainfall data in database by individual rows
      prn_by_mon = False    # get all monthly total rainfall data in database by month
      prn_12_mons = False   # run local version of code to get the historical monthly data
      do_pop_tbl = False    # use local code to update/populate monthly rainfall history table
      tst_pop_tbl = True    # use Weather_db.updt_hist_tbl() to update/populate monthly rainfall history table

      r_tnm = rfall.tnms['rf_tnm']
      h_tnm = rfall.tnms['mh_tnm']
... ...
      if tst_pop_tbl:
        r_cnt = rfall.updt_hist_tbl()
        print(f"added {r_cnt} rows to monthly history table")

        mn_data = rfall.qry_pd(f"SELECT * FROM {h_tnm}")
        print("\n", mn_data)

And in the terminal:

(dbd-3.13) PS R:\learn\dashboard> python data2db.py
added 12 rows to monthly history table

     row_id month         avg    min    max
0        1    01  217.950000   99.9  319.1
1        2    02  143.100000  110.1  202.5
2        3    03  198.666667   52.5  387.5
3        4    04  128.766667   35.5  210.0
4        5    05   71.421667   11.2  143.0
5        6    06   46.900000    5.5   81.9
6        7    07   26.716667    2.8   58.9
7        8    08   28.233333    7.1   78.5
8        9    09  115.383333   54.0  191.2
9       10    10  179.841667  121.1  260.5
10      11    11  275.591667  121.1  350.0
11      12    12  248.650000  169.3  339.0

That looks exactly like what we got earlier (a couple of times).

Process All Rainfall Files

I have been thinking about the rainfall data files. I expect in the real world that they would not be processed independently. Nor do I expect the people involved would allow duplicate data that had to some how be handled in the code processing the files. M’thinks that one or more people would manually process the data, combining it into one data source and eliminate the duplicates. So that’s what I am going to do.

Also, the spreadsheet I mentioned before has the data for 2021 missing from the rainfall PHP files. So, I am going to add that to the amalgamated data file. Either manually or programmatically. If the latter, I will likely save the data to a CSV file. Then write some code to add it, in the correct format to the data file created by combining the current contents, without duplicates, of the 3 files.

I will then create a new empty database and run all the code necessary to add the full data file to the database. Including populating the history table. I will also generate the dates done CSV file as I expect I may have use for it in future.

Amalgamate Data from 3 Files

It took me about an hour to manually create the file. The bulk of that being manually entering the data for May-December 2021. Might have been quicker if I coded something. But likely not a lot quicker, if at all.

Refactored get_rg_dsrcs() in rg_data.py to include the file, “rainGauge_2019-2021.inc.php”, in the list it returns (index 3).

New Database, New Processing Loop

I will again (re-)code the processing loop in a new if block, in data2db.py. But I may try refactoring it to avoid some of the checks I am making. We will see if that works or not.

Initiation Code

Here’s the setup for the new if block. Pretty much copy and paste.

  if do_rg_2021:
    chk_i_rf = True
    proc_fl = True
    mk_hist = True

    # instantiate dates done class using non-existent csv file
    cwd = Path(__file__).cwd()
    fl_pth = cwd/"data"
    fl_nm = "dates_done.csv"
    dd_pth = fl_pth/fl_nm
    dts_dn = Dates_done(dd_pth)
    # probably should add __repr__ or __str__ to both classes
    print(f"\ndts_dn:\n\tf_pth: {dts_dn.fl}, c_init: {dts_dn.c_init}\n\tdts: {dts_dn.dts}")

    # instantiate database class using empty database file
    cwd = Path(__file__).cwd()
    fl_pth = cwd/"data"
    fl_nm = "weather.db"
    db_pth = fl_pth/fl_nm
    rfall = Weather_db(db_pth)
    print(dts_dn, rfall)
 
    if proc_fl:
      print("\nproc_fl loop still to be coded")
      print(f"\tfile to be processed is {d_srcs[3]}")

      # set up generator for combined data file
      gen_d = get_rg_data(d_srcs[3])
      e_yr = "2021"
      # print(f"e_yr: {e_yr}")

    if mk_hist:
      print("\nmk_hist loop still to be coded")

And, in the terminal I got the following. Well, it’s not like I didn’t expect to get that output.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py

dts_dn:
        f_pth: R:\learn\dashboard\data\dates_done.csv, c_init: {}
        dts: {}
<utils.dates_dn.Dates_done object at 0x000002B007AA7770> <utils.weather_db.Weather_db object at 0x000002B00CB9FB60>

proc_fl loop still to be coded
        file to be processed is F:\BaRKqgs\home\binc\rainGauge_2019-2021.inc.php

mk_hist loop still to be coded

database table check


rows in rainfall table: 0

 Empty DataFrame
Columns: [c_month, minMonth, maxMonth]
Index: []

rows in rf_mon_history table: 0

 Empty DataFrame
Columns: [month, avg]
Index: []

File Processing Loop

Okay, I have run this loop a few times. I have, when getting results that didn’t at least closely match those in the spreadsheet, gone back and corrected the data in the amalgamated rainfall data file. I eventually said enough is enough. But, that really doesn’t mean I won’t go back and try to get results matching more precisely.

The loop is pretty much a repetition of previous code. But here it is anyway.

    if proc_fl:
      # process the data in the amalgamated rainGauge file month by month
      # add daily rainfall too database, record dates done
      p_mn, c_mn = "", ""               # prev mon, curr mon
      rf_rws, t_mn, m_dts = [], 0, {}   # rainfall rows, tot mon, mon dates

      # set up generator for combined data file
      gen_d = get_rg_data(d_srcs[3])
      e_yr = "2021"
      e_dt = f"{e_yr}.??.01"

      s_tm = time.perf_counter()
      for d_rw in gen_d:
        # parse data row
        mtch = parse_rg_row(d_rw)
        if not mtch:
          print(f"{d_rw} => {mtch}")
          continue
        else:
          d_dt, d_tm, d_rf = mtch.group(1), mtch.group(2), mtch.group(3)
        # just starting, set curr mon (yyyy.mm)
        if not c_mn:
          c_mn = d_dt[:7]
          m_dts[c_mn] = []
          print(f"init c_mn: {c_mn}")
        # if starting new month, update rainfall table and dates done,
        # set/reset variables accordingly
        if d_dt[:7] != c_mn:
          num_rws = rfall.add_rainfall(rf_rws)
          dts_dn.add_month(m_dts)

          rf_rws, t_mn = [], 0
          p_mn, c_mn = c_mn, d_dt[:7]
          m_dts = {c_mn: []}
        # otherwise process current data and add to list for the month
        if "??" not in d_dt and "??" not in d_rf:
          match(d_tm):
            case "pm":
              dttm = f"{d_dt} 16:00"
            case "am":
              dttm = f"{d_dt} 08:00"
            case _:
              dttm = f"{d_dt} {d_tm}"
          try:
            drf = float(d_rf)
          except ValueError:
            raise(ValueError(f"could not convert string to float: {d_rf}"))
          t_mn += drf
          rf_rws.append((dttm, drf, t_mn))
          m_dts[c_mn].append(d_dt[-2:])

      # if rf_rws not empty, updt rainfall table/dates done list
      if rf_rws:
        num_rws = rfall.add_rainfall(rf_rws)
        dts_dn.add_month(m_dts)
     
      # update dates done csv file
      dts_dn.updt_csv_file()

      e_tm = time.perf_counter()
      print(f"run time: {(e_tm - s_tm):.4f} secs")

    if mk_hist:
      h_tnm = rfall.tnms['mh_tnm']
      r_cnt = rfall.updt_hist_tbl()
      print(f"\nadded {r_cnt} rows to monthly history table ({h_tnm})")

      mn_data = rfall.qry_pd(f"SELECT * FROM {h_tnm}")
      print("\n", mn_data)

And, that code produced the following output in the terminal.

(dbd-3.13) PS R:\learn\dashboard> python data2db.py
init c_mn: 2014.03
<br>2016.12.31?? 09:54 &amp; 24:00: 0.5mm + 11.4mm. Month total: 170.9mm => None
<br>2017.03.27?? 08:31: 20.0mm. Month total: 296.0mm => None
<br>2018.03.03?? 24:00: 4.1mm. Month total: 12.0mm => None
<br>2018.10.30?? 24:00: 9.4mm. Month total: 94.7mm => None
<br>2018.10.30?? 24:00: 26.4mm. Month total: 121.1mm => None
<br>2021.??.01 08:00: ?.0mm. Month total: ?.0mm => None
run time: 5.3019 secs

added 12 rows to monthly history table (rf_mon_history)

     row_id month         avg    min     max
0        1    01  265.585714   99.9  460.45
1        2    02  143.600000  110.1  202.50
2        3    03  174.887500   52.5  387.50
3        4    04  106.150000   35.5  210.00
4        5    05   81.714444   11.2  143.00
5        6    06   51.850000    5.5   81.90
6        7    07   24.687500    0.0   58.90
7        8    08   35.433333    7.1   78.50
8        9    09  125.350000   54.0  195.20
9       10    10  181.191250  121.1  260.50
10      11    11  307.092500  121.1  541.59
11      12    12  259.481250  170.9  408.10

database table check


rows in rainfall table: 1226

     c_month  minMonth  maxMonth
0   2014.03       3.0      54.0
1   2014.04       1.5      53.0
2   2014.05       5.0      54.0
3   2014.06       2.0      14.5
4   2014.07       3.0      23.0
..      ...       ...       ...
89  2021.08       0.2      15.7
90  2021.09       0.3      81.5
91  2021.10       1.6      39.2
92  2021.11       0.3     109.7
93  2021.12       0.3      33.0

[94 rows x 3 columns]

rows in rf_mon_history table: 12

    month         avg
0     01  265.585714
1     02  143.600000
2     03  174.887500
3     04  106.150000
4     05   81.714444
5     06   51.850000
6     07   24.687500
7     08   35.433333
8     09  125.350000
9     10  181.191250
10    11  307.092500
11    12  259.481250

Comparing that against the spreadsheet, those values look pretty good. Good enough?

Done, I Do Believe

Since I have not yet used the plotly package I thought I might try plotting some of the data currently in the database. But, I think this post covered its subject matter rather nicely and I don’t want to add anything messing with that. So, perhaps a little charting in the next post.

Until then, I do believe a little focus never hurt any post, nor, probably, anyone.