Okay before processing any more data, I want to figure out how to use that query we developed last post to populate the historical monthly rainfall table. I do believe an executemany
insert should work just fine.
Populate Monthly History Table
Okay, a few steps here. You many not have expected this first one.
New Database Manipulation Method
I am going to delete the current contents of the table before I add the current historical data. As I will generate it each time I wish to update the history table. Seems easier than figuring out which rows to update and which to leave alone. Simplicity over speed.
But for that I will need a new method in the Weather_db
class. Didn’t want to refactor qry_nsrt
. And, qry_exec
doesn’t force a transaction commit. I need that commit to make sure the table contents are deleted before adding any new data. So, a new method, qry_dlt
. Not sure that is the most efficient approach. But as things stand, that’s how I am going to go about it.
... ...
def qry_dlt(self, qry):
"""Execute a DELETE query.
params:
qry: string with the appropriate query
returns: nothing
"""
conn = sqlite3.connect(self.db_pth)
curs = conn.cursor()
curs.execute(qry)
conn.commit()
curs.close()
conn.close()
Test Code
For now I am going to add this to data2db.py
. But, eventually, it will be another method in the Weather_db
class; as is fitting. I will execute the nested query and get the results back as a list of tuples. I will use that list as the row data for an executemany
insert transaction.
Let’s start with a look at the list returned when executing that nested query we used too get the data we want.
... ...
if mk_hist:
chk_i_rf = False
prn_all_mons = False
prn_by_mon = False
prn_12_mons = False
do_pop_tbl = True
... ...
if do_pop_tbl:
rslt = rfall.qry_exec(nx_q)
print(rslt)
And:
dbd-3.13) PS R:\learn\dashboard> python data2db.py
[('01', 217.95, 99.9, 319.1), ('02', 143.1, 110.09999999999998, 202.5), ('03', 198.66666666666666, 52.49999999999999, 387.5), ('04', 128.76666666666668, 35.5, 210.0), ('05', 71.42166666666667, 11.2, 143.0), ('06', 46.9, 5.5, 81.89999999999999), ('07', 26.71666666666667, 2.8, 58.900000000000006), ('08', 28.233333333333334, 7.1, 78.5), ('09', 115.38333333333333, 54.0, 191.20000000000002), ('10', 179.84166666666667, 121.1, 260.5), ('11', 275.59166666666664, 121.10000000000001, 350.0), ('12', 248.65, 169.3, 339.0)]
Okay now let’s delete the data in the history table.
... ...
if do_pop_tbl:
rslt = rfall.qry_exec(nx_q)
# print(rslt)
c_tnm = rfall.tnms['mh_tnm']
# empty table, will be full data population
print("\nDeleting all rows from table {c_tnm}")
dlt_hist = f"DELETE FROM {c_tnm}"
rfall.qry_dlt(dlt_hist)
mn_data = rfall.qry_pd(f"SELECT * FROM {c_tnm}")
print("\n", mn_data)
In the terminal I got the following.
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
Deleting all rows from table {c_tnm}
Empty DataFrame
Columns: [row_id, month, avg, min, max]
Index: []
Not that that tells us much. The table was empty to begin with. But no errors generated.
Okay, let’s add the code to insert the nested query data into the table. Pretty straightforward, but a few extra print statements during development.
... ...
# insert all rows from result of query above
print(f"\nInserting monthly rainfall history into table {c_tnm}")
in_hist_tbl = f"""INSERT INTO {c_tnm} VALUES (NULL, ?, ?, ?, ?);"""
rfall.qry_nsrt(in_hist_tbl, rslt, is_many=True)
c_rws = rfall.get_tbl_rw_cnt(c_tnm)
print(f"{c_rws} rows added to table {c_tnm}")
mn_data = rfall.qry_pd(f"SELECT * FROM {c_tnm}")
print("\n", mn_data)
And, in the terminal (including the data deletion) I got the following.
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
Deleting all rows from table {c_tnm}
Empty DataFrame
Columns: [row_id, month, avg, min, max]
Index: []
Inserting monthly rainfall history into table rf_mon_history
12 rows added to table rf_mon_history
row_id month avg min max
0 1 01 217.950000 99.9 319.1
1 2 02 143.100000 110.1 202.5
2 3 03 198.666667 52.5 387.5
3 4 04 128.766667 35.5 210.0
4 5 05 71.421667 11.2 143.0
5 6 06 46.900000 5.5 81.9
6 7 07 26.716667 2.8 58.9
7 8 08 28.233333 7.1 78.5
8 9 09 115.383333 54.0 191.2
9 10 10 179.841667 121.1 260.5
10 11 11 275.591667 121.1 350.0
11 12 12 248.650000 169.3 339.0
And, if we run code again, we get the following in the terminal. Look familiar?
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
Deleting all rows from table {c_tnm}
Empty DataFrame
Columns: [row_id, month, avg, min, max]
Index: []
Inserting monthly rainfall history into table rf_mon_history
12 rows added to table rf_mon_history
row_id month avg min max
0 1 01 217.950000 99.9 319.1
1 2 02 143.100000 110.1 202.5
2 3 03 198.666667 52.5 387.5
3 4 04 128.766667 35.5 210.0
4 5 05 71.421667 11.2 143.0
5 6 06 46.900000 5.5 81.9
6 7 07 26.716667 2.8 58.9
7 8 08 28.233333 7.1 78.5
8 9 09 115.383333 54.0 191.2
9 10 10 179.841667 121.1 260.5
10 11 11 275.591667 121.1 350.0
11 12 12 248.650000 169.3 339.0
Convert Test Code to Class Method
Okay, let’s get the necessary parts of the above code into a new class method, updt_hist_tbl()
. Mostly copy and paste. Change all the references to rfall
to self
, eliminate the print statements and return the number of rows added to the table (a safety check).
def updt_hist_tbl(self):
"""Update history table based on current state of rainfall table,
after deleting current contents of history table.
params:
None
returns:
count of rows added to history table
"""
c_tnm = self.tnms['mh_tnm']
# get current historical data from rainfall table
nx_q = f"""SELECT substr(datetime, 6, 2) as mon, AVG(monthly) as avg, MIN(monthly) as min, MAX(monthly) as max
FROM rainfall
WHERE datetime IN
(SELECT datetime FROM
(SELECT datetime, substr(datetime, 1, 4) as year, substr(datetime, 6, 2) as month, MAX(monthly) as totMonth
FROM rainfall
GROUP BY year, month))
GROUP BY mon;"""
rslt = self.qry_exec(nx_q)
# empty table, will be full data population
dlt_hist = f"DELETE FROM {c_tnm}"
self.qry_dlt(dlt_hist)
# insert all rows from result of query above
in_hist_tbl = f"""INSERT INTO {c_tnm} VALUES (NULL, ?, ?, ?, ?);"""
self.qry_nsrt(in_hist_tbl, rslt, is_many=True)
c_rws = self.get_tbl_rw_cnt(c_tnm)
return(c_rws)
Quick test in data2db.py
.
... ...
if mk_hist:
chk_i_rf = False # run simple, local database test code
prn_all_mons = False # get all monthly total rainfall data in database by individual rows
prn_by_mon = False # get all monthly total rainfall data in database by month
prn_12_mons = False # run local version of code to get the historical monthly data
do_pop_tbl = False # use local code to update/populate monthly rainfall history table
tst_pop_tbl = True # use Weather_db.updt_hist_tbl() to update/populate monthly rainfall history table
r_tnm = rfall.tnms['rf_tnm']
h_tnm = rfall.tnms['mh_tnm']
... ...
if tst_pop_tbl:
r_cnt = rfall.updt_hist_tbl()
print(f"added {r_cnt} rows to monthly history table")
mn_data = rfall.qry_pd(f"SELECT * FROM {h_tnm}")
print("\n", mn_data)
And in the terminal:
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
added 12 rows to monthly history table
row_id month avg min max
0 1 01 217.950000 99.9 319.1
1 2 02 143.100000 110.1 202.5
2 3 03 198.666667 52.5 387.5
3 4 04 128.766667 35.5 210.0
4 5 05 71.421667 11.2 143.0
5 6 06 46.900000 5.5 81.9
6 7 07 26.716667 2.8 58.9
7 8 08 28.233333 7.1 78.5
8 9 09 115.383333 54.0 191.2
9 10 10 179.841667 121.1 260.5
10 11 11 275.591667 121.1 350.0
11 12 12 248.650000 169.3 339.0
That looks exactly like what we got earlier (a couple of times).
Process All Rainfall Files
I have been thinking about the rainfall data files. I expect in the real world that they would not be processed independently. Nor do I expect the people involved would allow duplicate data that had to some how be handled in the code processing the files. M’thinks that one or more people would manually process the data, combining it into one data source and eliminate the duplicates. So that’s what I am going to do.
Also, the spreadsheet I mentioned before has the data for 2021 missing from the rainfall PHP files. So, I am going to add that to the amalgamated data file. Either manually or programmatically. If the latter, I will likely save the data to a CSV file. Then write some code to add it, in the correct format to the data file created by combining the current contents, without duplicates, of the 3 files.
I will then create a new empty database and run all the code necessary to add the full data file to the database. Including populating the history table. I will also generate the dates done CSV file as I expect I may have use for it in future.
Amalgamate Data from 3 Files
It took me about an hour to manually create the file. The bulk of that being manually entering the data for May-December 2021. Might have been quicker if I coded something. But likely not a lot quicker, if at all.
Refactored get_rg_dsrcs()
in rg_data.py
to include the file, “rainGauge_2019-2021.inc.php”, in the list it returns (index 3).
New Database, New Processing Loop
I will again (re-)code the processing loop in a new if
block, in data2db.py
. But I may try refactoring it to avoid some of the checks I am making. We will see if that works or not.
Initiation Code
Here’s the setup for the new if
block. Pretty much copy and paste.
if do_rg_2021:
chk_i_rf = True
proc_fl = True
mk_hist = True
# instantiate dates done class using non-existent csv file
cwd = Path(__file__).cwd()
fl_pth = cwd/"data"
fl_nm = "dates_done.csv"
dd_pth = fl_pth/fl_nm
dts_dn = Dates_done(dd_pth)
# probably should add __repr__ or __str__ to both classes
print(f"\ndts_dn:\n\tf_pth: {dts_dn.fl}, c_init: {dts_dn.c_init}\n\tdts: {dts_dn.dts}")
# instantiate database class using empty database file
cwd = Path(__file__).cwd()
fl_pth = cwd/"data"
fl_nm = "weather.db"
db_pth = fl_pth/fl_nm
rfall = Weather_db(db_pth)
print(dts_dn, rfall)
if proc_fl:
print("\nproc_fl loop still to be coded")
print(f"\tfile to be processed is {d_srcs[3]}")
# set up generator for combined data file
gen_d = get_rg_data(d_srcs[3])
e_yr = "2021"
# print(f"e_yr: {e_yr}")
if mk_hist:
print("\nmk_hist loop still to be coded")
And, in the terminal I got the following. Well, it’s not like I didn’t expect to get that output.
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
dts_dn:
f_pth: R:\learn\dashboard\data\dates_done.csv, c_init: {}
dts: {}
<utils.dates_dn.Dates_done object at 0x000002B007AA7770> <utils.weather_db.Weather_db object at 0x000002B00CB9FB60>
proc_fl loop still to be coded
file to be processed is F:\BaRKqgs\home\binc\rainGauge_2019-2021.inc.php
mk_hist loop still to be coded
database table check
rows in rainfall table: 0
Empty DataFrame
Columns: [c_month, minMonth, maxMonth]
Index: []
rows in rf_mon_history table: 0
Empty DataFrame
Columns: [month, avg]
Index: []
File Processing Loop
Okay, I have run this loop a few times. I have, when getting results that didn’t at least closely match those in the spreadsheet, gone back and corrected the data in the amalgamated rainfall data file. I eventually said enough is enough. But, that really doesn’t mean I won’t go back and try to get results matching more precisely.
The loop is pretty much a repetition of previous code. But here it is anyway.
if proc_fl:
# process the data in the amalgamated rainGauge file month by month
# add daily rainfall too database, record dates done
p_mn, c_mn = "", "" # prev mon, curr mon
rf_rws, t_mn, m_dts = [], 0, {} # rainfall rows, tot mon, mon dates
# set up generator for combined data file
gen_d = get_rg_data(d_srcs[3])
e_yr = "2021"
e_dt = f"{e_yr}.??.01"
s_tm = time.perf_counter()
for d_rw in gen_d:
# parse data row
mtch = parse_rg_row(d_rw)
if not mtch:
print(f"{d_rw} => {mtch}")
continue
else:
d_dt, d_tm, d_rf = mtch.group(1), mtch.group(2), mtch.group(3)
# just starting, set curr mon (yyyy.mm)
if not c_mn:
c_mn = d_dt[:7]
m_dts[c_mn] = []
print(f"init c_mn: {c_mn}")
# if starting new month, update rainfall table and dates done,
# set/reset variables accordingly
if d_dt[:7] != c_mn:
num_rws = rfall.add_rainfall(rf_rws)
dts_dn.add_month(m_dts)
rf_rws, t_mn = [], 0
p_mn, c_mn = c_mn, d_dt[:7]
m_dts = {c_mn: []}
# otherwise process current data and add to list for the month
if "??" not in d_dt and "??" not in d_rf:
match(d_tm):
case "pm":
dttm = f"{d_dt} 16:00"
case "am":
dttm = f"{d_dt} 08:00"
case _:
dttm = f"{d_dt} {d_tm}"
try:
drf = float(d_rf)
except ValueError:
raise(ValueError(f"could not convert string to float: {d_rf}"))
t_mn += drf
rf_rws.append((dttm, drf, t_mn))
m_dts[c_mn].append(d_dt[-2:])
# if rf_rws not empty, updt rainfall table/dates done list
if rf_rws:
num_rws = rfall.add_rainfall(rf_rws)
dts_dn.add_month(m_dts)
# update dates done csv file
dts_dn.updt_csv_file()
e_tm = time.perf_counter()
print(f"run time: {(e_tm - s_tm):.4f} secs")
if mk_hist:
h_tnm = rfall.tnms['mh_tnm']
r_cnt = rfall.updt_hist_tbl()
print(f"\nadded {r_cnt} rows to monthly history table ({h_tnm})")
mn_data = rfall.qry_pd(f"SELECT * FROM {h_tnm}")
print("\n", mn_data)
And, that code produced the following output in the terminal.
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
init c_mn: 2014.03
<br>2016.12.31?? 09:54 & 24:00: 0.5mm + 11.4mm. Month total: 170.9mm => None
<br>2017.03.27?? 08:31: 20.0mm. Month total: 296.0mm => None
<br>2018.03.03?? 24:00: 4.1mm. Month total: 12.0mm => None
<br>2018.10.30?? 24:00: 9.4mm. Month total: 94.7mm => None
<br>2018.10.30?? 24:00: 26.4mm. Month total: 121.1mm => None
<br>2021.??.01 08:00: ?.0mm. Month total: ?.0mm => None
run time: 5.3019 secs
added 12 rows to monthly history table (rf_mon_history)
row_id month avg min max
0 1 01 265.585714 99.9 460.45
1 2 02 143.600000 110.1 202.50
2 3 03 174.887500 52.5 387.50
3 4 04 106.150000 35.5 210.00
4 5 05 81.714444 11.2 143.00
5 6 06 51.850000 5.5 81.90
6 7 07 24.687500 0.0 58.90
7 8 08 35.433333 7.1 78.50
8 9 09 125.350000 54.0 195.20
9 10 10 181.191250 121.1 260.50
10 11 11 307.092500 121.1 541.59
11 12 12 259.481250 170.9 408.10
database table check
rows in rainfall table: 1226
c_month minMonth maxMonth
0 2014.03 3.0 54.0
1 2014.04 1.5 53.0
2 2014.05 5.0 54.0
3 2014.06 2.0 14.5
4 2014.07 3.0 23.0
.. ... ... ...
89 2021.08 0.2 15.7
90 2021.09 0.3 81.5
91 2021.10 1.6 39.2
92 2021.11 0.3 109.7
93 2021.12 0.3 33.0
[94 rows x 3 columns]
rows in rf_mon_history table: 12
month avg
0 01 265.585714
1 02 143.600000
2 03 174.887500
3 04 106.150000
4 05 81.714444
5 06 51.850000
6 07 24.687500
7 08 35.433333
8 09 125.350000
9 10 181.191250
10 11 307.092500
11 12 259.481250
Comparing that against the spreadsheet, those values look pretty good. Good enough?
Done, I Do Believe
Since I have not yet used the plotly
package I thought I might try plotting some of the data currently in the database. But, I think this post covered its subject matter rather nicely and I don’t want to add anything messing with that. So, perhaps a little charting in the next post.
Until then, I do believe a little focus never hurt any post, nor, probably, anyone.