Note:
With my efforts on spirographs currently in limbo, I am totally lost as what to attempt or talk about next. So, the posts for the next while are definitely going to be lacking in direction and/or purpose. Sorry! And, having recently contracted Covid, I don’t see things improving anytime soon.
Since I first, many years ago, encountered regular expressions, in Perl, they have continued to fascinate me. At that long ago time, I would try to use them for anything and everything. Likely overkill in many cases, quite impossible in others. But, I never really got any good at anything but the simplest of uses.
When I started doing the Wordle and Canuckle puzzles (I try to do them every day), I figured I could use regular expressions (regex) to my advantage. I decided to write a brief post on that experience. And who knows, I may look at more posts looking at other uses of regular expressions.
A simple string search won’t work. I don’t know of any such search that can say “if the word has some character in it don’t match that word.” That’s why regular expressions were created. They provide a whole lot of options when searching through text. In some cases, too many options I expect.
Simple Start
I started out pretty simply: for each position, specify what can’t be there or what was known to be there. Then search through various files on my computer. Note: I eventually created a file full of five letter words pulled from various web sites.
For example, let’s consider this initial guess.
Okay, we know where one letter belongs and a whole bunch that don’t. My regex would look like this.
\b[^cost][^cost]a[^cost][^cost]\b
The \b
is a regular expression’s way of specifying a word boundary. The negative character classes ([^...]
) specify which letters can’t be at that spot in the word. A single character says that character must be at this spot in the word. So far so good.
You may wonder why I specify the character class for each location. For example I could have used the following.
\b[^cost]{2}a[^cost]{2}\b
The {2}
telling the regex engine to repeat the preceding group twice. I used the more verbose version so I could more visually deal with characters that belonged in the word but without a known location.
And after another bad guess.
No really good new information. Those negative character classes are just getting bigger.
\b[^costberd][^costberd]a[^costberd][^costberd]\b
In the next guess we do get a bit more information, a letter that is in the word but not in the correct position. For the longest time, I just included it in the negative character class for that location—after a space
. That was my way of visually keeping track of characters that should be in the word, but not yet in the right location. For example:
So, have more undesireable letters, but also now know that an l
belongs in the word. The regular expression would now look like the following.
\b[^costberdfnk][^costberdfnk l]a[^costberdfnk][^costberdfnk]\b
That sort of worked, but I ended up looking at a lot words I didn’t really want to see. I.E. words that matched the regex but didn’t have an l
in them. I, with my aging brain, often tried words that were missing one of the required but not specifically located letters.
There had to be a better way to define the regular expression. One that checked for the existence of the required letter(s) somewhere in the word, before analyzing it any further.
Lookarounds to the Rescue
And, one of the features of modern regular expression engines, handle exactly that need: lookarounds. Lookarounds are zero-length assertions. The lookaround actually matches characters, but then discards the matched characters, simply returning match or no match. It is important to remember that lookarounds only lookaround from the position that the engine is currently at within the regular expression. And, only look at the character(s) adjacent to it.
For my case, what I needed was a lookahead assertion. I would at the beginning of my regular expression add a lookahead for each letter I needed to have in the word but whose location(s) were not yet known. So, that last regular expression, now (many months of confused effort after I started using them) looks like the following.
\b(?=\w*l)[^costberdfnk][^costberdfnk l]a[^costberdfnk][^costberdfnk]\b
What that (?=\w*l)
assertion is asking/saying is: from the start of a word, looking ahead (toward the end of the word) are there any number of word characters followed by an l
? That \w*
is absolutely essential. If I used (?=l)
, the lookahead would only have considered the first letter in the word. The \w*
pattern, allows looking through the whole word if necessary. But it would only look as far as necessary if the l
is found.
And, now I don’t need to look through words that don’t have a chance of being correct choices. I really do love regular expressions!
Another Example
Here’s an example from Canuckle. I frequently start Canuckle puzzles with coast; which is the case here.
\b(?=\w*a)(?=\w*s)(?=\w*t)[^co][^co][^co a][^co s][^co t]\b
So, 3 lookahead assertions and the usual don’t like these here and/or this is good here assertions/expressions.
And the next guess. Not really sure why I chose that one; but two letters settled.
And, the ensuing regex.
\b(?=\w*s)[^coal]t[^coal a]a[^coal ts]\b
Well, beef is big in Canada, so I figured let’s try steak.
Three known letters settled. But, 2 empty spaces and no hints. And the consequent regex.
\bst[^coalek ts]a[^coalek ts]\b
Well you know lots of hay goes to feeding all that cattle. So, I tried straw as a five letter representative for hay.
And the final, and correct, guess.
And, just so you know, the strap being referred to is the jock strap. Quoting the Canuckle site:
Guelph, Ontario has cemented its name in the history of sport by being the birthplace of the athletic supporter. The device was invented in 1922 by the Guelph Elastic Hosiery Company (now Protexion Industries), which later held a contest to name its new product. The name ‘jock strap’ was chosen and the winner of the contest got a cash prize of five dollars, worth $88.69 today.
As a Canuck, I do enjoy those little tidbits that come with each day’s puzzle.
One Final Example
I didn’t solve this one. I don’t usually start my Wordle puzzle with coast, but somehow chose to do so for this one. Apparently a significantly bad choice. I won’t bother with the resulting regex. My next guess was barely better, biker. But a touch better.
And the resulting regex now looks like the following.
\b(?=\w*r)[^coastbike][^coastbike][^coastbike][^coastbike][^coastbike r]\b
When my search found gruff I thought that might just be crazy enough to be right.
Not much better. That regex is beginning to get lengthy.
\b(?=\w*r)(?=\w*u)[^coastbikegf][^coastbikegf r][^coastbikegf u][^coastbikegf][^coastbikegf r]\b
I thought I should start speeding up the process. The resulting guess did help a touch.
Well at least the attendant regex is a fair bit shorter.
b(?=\w*r)[^coastbikegfh]u[^coastbikegfhr u][^coastbikegfhr]y\b
Found a few possible words; after some, perhaps, thought, I chose the following.
A bit of an improvement in the regex.
\bru[^coastbikegfhrn u][^coastbikegfhrn]y\b
But no luck with the last guess. It did seem a little too obvious.
By the way, the solution was ruddy. Which I did find in my regex search even before trying runny. But didn’t think it to be a likely solution so didn’t try it.
Done
Don’t think those last two examples were of much real use to some one beginning to play with regular expressions. But, I had saved them so decided to include them rather than try a completely different exercise.
As mentioned, those may yet show up in the blog.
Until then, keep challenging your mind and those fingers dancing. Well, to be entirely honest, my fingers only really dance when working on a blog post. Not so much when working on code.
Resources
- The absolute bare minimum every programmer should know about regular expressions
- 5 Regular Expressions Every Web Programmer Should Know
- Extreme regex foo: what you need to know to become a regular expression pro
- Lookahead and Lookbehind Zero-Length Assertions
- The Regex Tutorial Challenge
- Regex Tuesday - Challenges
- Regular Expressions: Now You Have Two Problems