Okay, I have the four daily note files parsed as best I can. Lots of junk in all of them because of the way I did things. Going to sort out parsing them and saving the data to the appropriate database table. Along with the extra bits of information in CSV files and/or other database tables.
Will start working with the 2022 notes file. Once I have things working I will likely combine all the files into one single file. Restore the database to its finished 2021 state. And process the combined file in one go. Similar to what I did with the first set of rainfall files.
Refactor File Generator
Okay, new file format, so have decided to refactor rg_data.get_rg_data()
to handle it as well.
def get_rg_data(rg_fl, f_typ="rg"):
'''Generator for raingauge php files, only want certain lines so decided use generator
Parameters:
rg_fl: full path to file
f_typ: which file version is file
'rg' = rain gauge style,
'dn' = daily note style,
'dna' = daily note style, but less restrictive than 'dn'
'dnp' = parsed daily note file, contains rainfall data along with misc other note data
'''
with open(rg_fl, "r", encoding='utf8') as ds:
... ...
if f_typ in ["dn", "dna", "dnp"] :
... ...
elif f_typ == 'dnp':
if not ln.startswith(("no match", "e-gauge")) and len(ln) > 10:
do_yld = True
if do_yld:
if f_typ != 'dnp':
ln = f"{c_dt} {ln}"
else:
if ln.startswith("tube"):
ln = ln[6:]
New Development/Test Loop
As is my habit, a new if
loop in some module or other for the development/test code. In this case I am going to stick with the data2db
module.
To start with I am going to get a list of file paths (pathlib) to the four new files I created. Thought about updating rg_data.get_rg_dsrcs()
; but decided against it. For now, really only a one time use.
Setup Loop
... ...
if do_dn:
chk_i_rf = False
proc_fl_1 = False
proc_fl_2 = False
c_src = 11
dn_rf_2db = True
... ...
if dn_rf_2db:
# process the data in the parsed daily notes file month by month
# add daily rainfall too database, record dates done
# for now only going to look at 2022 file
c_dnp = 0
# get list of all data sources
d_srcs = []
cwd = Path(__file__).cwd()
d_pth = cwd/"data"
s_yrs = list(range(2022, 2026))
for yr in s_yrs:
d_srcs.append(d_pth/f"dn_{yr}_parse.txt")
print(d_srcs)
print(f"{d_srcs[c_dnp].name} exists: {d_srcs[c_dnp].exists()}, is file: {d_srcs[c_dnp].is_file()}")
And in the terminal, I got the following.
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
[WindowsPath('R:/learn/dashboard/data/dn_2022_parse.txt'), WindowsPath('R:/learn/dashboard/data/dn_2023_parse.txt'), WindowsPath('R:/learn/dashboard/data/dn_2024_parse.txt'), WindowsPath('R:/learn/dashboard/data/dn_2025_parse.txt')]
dn_2022_parse.txt exists: True, is file: True
Okay, now let’s set up the file generator. Then parse and print a number of the lines we get back. I am going for 28 iterations. The generator should be returning strings that look like this: 2022.01.02 08:00: 2.26 mm, 2.26
. But immediately ran into errors. So put my code in a try/except
block. Fixed errors as they came up. Eventually did that for every file for every line returned.
... ...
print(f"Processing file: {d_srcs[c_dnp].name}\n")
# set up generator for current data file
gen_d = get_rg_data(d_srcs[c_dnp], f_typ="dnp")
# let's get and parse the first 5 lines yeilded
# make sure we get numbers where needed
for i in range(28):
d_rw = next(gen_d)
try:
rf_dt, rf_tm, d_rf, j_rf, m_rf = d_rw.split(" ")
rf_tm = rf_tm[:-1]
d_rf = float(d_rf)
m_rf = float(m_rf)
except ValueError:
print(f"\nerror: {d_rw}\n")
else:
print(f"{rf_dt} {rf_tm}: {d_rf} {j_rf[:-1]} -> {m_rf}")
And in the terminal I, finally, got the following. Which, for better or worse, does match what’s in the file.
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
Processing file: dn_2022_parse.txt
2022.01.02 08:00: 2.26 mm -> 2.26
2022.01.03 08:00: 3.39 mm -> 5.65
2022.01.04 08:00: 1.58 mm -> 7.23
2022.01.05 08:00: 9.26 mm -> 16.48
2022.01.06 19:00: 24.84 mm -> 41.32
2022.01.07 08:00: 14.23 mm -> 55.55
2022.01.08 08:00: 14.23 mm -> 69.77
2022.01.09 08:00: 14.23 mm -> 84.0
2022.01.10 08:00: 4.87 mm -> 99.87
2022.01.11 08:00: 77.92 mm -> 166.78
2022.01.12 08:00: 24.35 mm -> 191.12
2022.01.30 08:00: 11.9 mm -> 203.02
2022.01.31 08:00: 17.8 mm -> 220.82
2022.02.03 08:00: 6.8 mm -> 6.8
2022.02.04 08:00: 12.5 mm -> 19.3
2022.02.05 08:00: 4.3 mm -> 23.6
2022.02.08 08:00: 4.5 mm -> 28.1
2022.02.09 08:00: 4.3 mm -> 32.4
2022.02.11 08:00: 0.3 mm -> 32.7
2022.02.14 08:00: 6.4 mm -> 39.1
2022.02.15 08:00: 1.2 mm -> 40.3
2022.02.19 08:00: 1.8 mm -> 42.1
2022.02.20 08:00: 7.1 mm -> 49.2
2022.02.21 08:00: 0.5 mm -> 49.7
2022.02.25 08:00: 0.8 mm -> 50.5
2022.02.27 08:00: 14.7 mm -> 65.2
2022.02.28 08:00: 30.7 mm -> 95.9
2022.02.28 24:00: 5.4 mm -> 101.3
Rainfall Processing Loop
Okay, let’s move on and start working on the code to take that data and get it into the database table(s). As it is so straightforward to parse the data, I think I will just continue to do so with code in the loop. I will for each month build a list of tuples, which I will then use to add that month’s data to the database. Pretty much what we did with the earlier file type. I will also, as before, keep track of the dates processed. Ensuring I do not process a given date more than once. If an attempt is made I will post warning message to the terminal.
As this is still development mode, for now I won’t bother updating the monthly history table. Once I have things sorted and am processing the files for real, I will, once done, do so.
Okay have code that works for processing first two months of 2022. Going to refactor and test processing the whole 2022 file. Will show the refactored code below.
The if dn_rf_2db
block has grown much larger. I expect I would have coded some functions if this was something that would be repeated regularly. But, that is not the case; so, some sloppy coding. And a little extra code added to the if chk_i_rf
block.
I also ran into a problem with the dates done related code. The class was expecting a dictionary with keys in the form yyyy.mm
. I was passing it mm
. A new variable and some added code to fix that problem. Again, if this processing of months or years of data at one time was something that would be repeated in future, I would likely refactor the class in some way to simplify the processing loop code.
Here’s all the code new/revised code.
if dn_rf_2db:
# process the data in the parsed daily notes file month by month
# add daily rainfall too database, record dates done
# for now only going to look at 2022 file
c_dnp = 0
chk_i_rf = True
# get list of all data sources
d_srcs = []
cwd = Path(__file__).cwd()
d_pth = cwd/"data"
s_yrs = list(range(2022, 2026))
for yr in s_yrs:
d_srcs.append(d_pth/f"dn_{yr}_parse.txt")
print(f"Processing file: {d_srcs[c_dnp].name}\n")
# set up generator for current data file
gen_d = get_rg_data(d_srcs[c_dnp], f_typ="dnp")
# let's get and parse the first 5 lines yeilded
# make sure we get numbers where needed
p_mn, c_mn = "", "" # prev mon, curr mon
rf_rws, m_dts = [], {} # rainfall rows, tot mon, mon dates
tot_rws = 0
# for dev will use start and end month, no longer really need
# these two variables, but I was using to init done_mn
s_mn, e_mn = "01", "02"
rf_yr = s_yrs[c_dnp]
done_mn = f"{rf_yr}.{s_mn}"
s_tm = time.perf_counter()
for d_rw in gen_d:
try:
rf_dt, rf_tm, d_rf, j_rf, m_rf = d_rw.split(" ")
rf_tm = rf_tm[:-1]
dt_rf = f"{rf_dt} {rf_tm}" # datetime
d_rf = float(d_rf) # daily rainfall
m_rf = float(m_rf) # month-to-date rainfall
except ValueError:
print(f"\nerror: {d_rw}\n")
d_mn = rf_dt[5:7]
# if just starting processing file, initialize variables
if not c_mn:
c_mn = d_mn
m_dts[done_mn] = []
print(f"\tcurrent month: {c_mn} -> {done_mn}")
# if new month, save data, updt controls, etc.
if rf_dt[5:7] != c_mn:
num_rws = rfall.add_rainfall(rf_rws)
tot_rws += num_rws
dts_dn.add_month(m_dts)
rf_rws = []
p_mn, c_mn = c_mn, rf_dt[5:7]
rf_yr = rf_dt[:4]
done_mn = f"{rf_yr}.{c_mn}"
m_dts = {done_mn: []}
print(f"\tcurrent month: {c_mn} -> {done_mn}")
# process current date's rainfall
rf_rws.append((dt_rf, d_rf, m_rf))
m_dts[done_mn].append(rf_dt[-2:])
# if rf_rws not empty, updt rainfall table/dates done list
if rf_rws:
# set things up for chk_i_rf block if full or partial year
if c_mn == "12":
p_mn = "12"
else:
p_mn = c_mn
num_rws = rfall.add_rainfall(rf_rws)
tot_rws += num_rws
dts_dn.add_month(m_dts)
# update dates done csv file
dts_dn.updt_csv_file()
# update monthly rainfall history table
rfall.updt_hist_tbl()
e_tm = time.perf_counter()
print(f"{tot_rws} records added to database table")
print(f"run time: {(e_tm - s_tm):.4f} secs")
if chk_i_rf:
print("\ndatabase table check")
for c_tnm in rfall.tnms.values():
rslt = rfall.get_tbl_rw_cnt(c_tnm)
print(f"\nrows in {c_tnm} table: {rslt}")
if c_tnm == "rainfall":
nx_q = f"""SELECT substr(datetime, 1, 7) as c_month, MIN(daily) as minMonth, MAX(daily) as maxMonth
FROM {c_tnm}
GROUP BY c_month;"""
elif c_tnm == "rf_mon_history":
nx_q = f"SELECT month, avg FROM {c_tnm}"
mn_data = rfall.qry_pd(nx_q)
print("\n", mn_data)
print(f"\nselecting rows where datetime like '{s_yrs[c_dnp]}.{p_mn}%'")
nx_q = f"""SELECT * FROM rainfall WHERE datetime LIKE '{s_yrs[c_dnp]}.{p_mn}%';"""
mn_data = rfall.qry_pd(nx_q)
print(mn_data)
And in the terminal I got the following. Sorry a touch lengthy.
(dbd-3.13) PS R:\learn\dashboard> python data2db.py
Processing file: dn_2022_parse.txt
current month: 01 -> 2022.01
current month: 02 -> 2022.02
current month: 03 -> 2022.03
current month: 04 -> 2022.04
current month: 05 -> 2022.05
current month: 06 -> 2022.06
current month: 07 -> 2022.07
current month: 08 -> 2022.08
current month: 09 -> 2022.09
current month: 10 -> 2022.10
current month: 11 -> 2022.11
current month: 12 -> 2022.12
155 records added to database table
run time: 0.8998 secs
database table check
rows in rainfall table: 1381
c_month minMonth maxMonth
0 2014.03 3.0 54.0
1 2014.04 1.5 53.0
2 2014.05 5.0 54.0
3 2014.06 2.0 14.5
4 2014.07 3.0 23.0
.. ... ... ...
101 2022.08 5.5 5.5
102 2022.09 1.0 4.5
103 2022.10 1.0 53.0
104 2022.11 0.2 44.9
105 2022.12 0.3 31.0
[106 rows x 3 columns]
rows in rf_mon_history table: 12
month avg
0 01 259.990000
1 02 138.312500
2 03 175.455556
3 04 105.633333
4 05 85.073000
5 06 53.422222
6 07 24.988889
7 08 32.440000
8 09 112.533333
9 10 181.725556
10 11 290.437778
11 12 249.650000
selecting rows where datetime like '2022.12%'
row_id datetime daily monthly
0 1364 2022.12.03 08:00 2.0 2.0
1 1365 2022.12.07 08:00 5.8 7.8
2 1366 2022.12.08 08:00 3.3 11.1
3 1367 2022.12.09 08:00 3.0 14.1
4 1368 2022.12.10 08:00 15.0 29.1
5 1369 2022.12.11 08:00 9.9 39.0
6 1370 2022.12.14 08:00 0.5 39.5
7 1371 2022.12.15 08:00 0.3 39.8
8 1372 2022.12.18 08:00 1.0 40.8
9 1373 2022.12.18 08:00 11.0 51.8
10 1374 2022.12.20 08:00 31.0 82.8
11 1375 2022.12.23 12:00 10.0 92.8
12 1376 2022.12.24 08:00 9.9 102.7
13 1377 2022.12.25 08:00 24.4 127.1
14 1378 2022.12.26 08:00 16.0 143.1
15 1379 2022.12.27 08:00 14.2 157.3
16 1380 2022.12.28 08:00 4.9 162.2
17 1381 2022.12.31 08:00 8.8 171.0
Also, the history table data has changed from what it was before the added year, 2022, of data. I haven’t checked all the months, but the data displayed above for December 2022 is correct. So going to assume the other months of 2022 will also be correct.
Done
Not really finished adding the rest of the years, but I will do so year by year over the next day or two. That will save me from some additional refactoring and file combining. If I combine the files into a single processing run, I will need to account for the changing years from start to end of the file. By doing it one year at a time, the current code will handle that without any refactoring. (Yes, lazy me!)
And, there was another function/method I wanted to write. It would take as input the date, time, and rainfall amount for some given day, and write that to the database. This is something that the dashboard will need to do each day in order to keep things up-to-date.
Going to leave that for the next, probably very short, post.
Until then, I wish you much joy from the successes in your hard fought battles. I am still feeling that joy working on this project.