A common exercise on regular expression challenge sites is matching IPv4 addresses. Most only look at matching the dotted decimal format. But, there are other formats allowed. In fact, IPv4 addresses may be represented in any notation expressing a 32-bit integer value.
I am going to start with the dotted decimal, then see if I can add the dotted hexadecimal, dotted octal, hexadecimal, decimal and octal representations. I expect this will become one messy regular expression. So, I may cheat and use Python to apply repeated expressions until one matches or they all fail. May use a few extra CPU cycles, but may be a lot easier to understand.
Dotted Decimal
The numeric values in any dotted format are commonly referred to as ‘octets’. That’s because whether decimal, hexidecimal or octal they are representing 8 binary digits from the full 32 bit binary IPv4 address.
Naive Approach
For most of us, our initial thought would be 4 octets of 1-3 digits separated by periods/dots. And, since regexs use .
to represent any character we will need to escape our decimal point, \.
, in our expression. Something like the following. Since I only want to check if the input is a valid IPv4 address, I am not specifying word boundaries and am using non-capturing groups, (?: )
, for the first three elements. Which by the way include the decimal point.
(:?\d{1,3}\.){3}\d{1,3}
Some Python code to test the above pattern.
N.B. most of the test cases I am using come from the Regex Tuesday Challenge - Week Six. Thank you Callum Macrae.
import re
ddc_ok = [
'192.0.2.235',
'99.198.122.146',
'18.101.25.153',
'23.71.254.72',
'100.100.100.100',
'173.194.34.134',
'212.58.241.131',
'46.51.197.88',
]
ddc_no = [
'256.256.256.256',
'925.254.255.254'
]
rgx = re.compile(r'(?:\d{1,3}\.){3}\d{1,3}')
print("\nThe following should all pass:")
for tst in ddc_ok:
r_tst = rgx.match(tst)
print(f"\t{tst} -> {'valid' if r_tst else 'not valid'}")
print("\nThe following should all fail:")
for tst in ddc_no:
r_tst = rgx.match(tst)
print(f"\t{tst} -> {'valid' if r_tst else 'not valid'}")
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py
The following should all pass:
192.0.2.235 -> valid
99.198.122.146 -> valid
18.101.25.153 -> valid
23.71.254.72 -> valid
100.100.100.100 -> valid
173.194.34.134 -> valid
212.58.241.131 -> valid
46.51.197.88 -> valid
The following should all fail:
256.256.256.256 -> valid
925.254.255.254 -> valid
A Little Less Naive
And we can see that the naive approach does not work. Our regular expression allows any digit in any of the three locations in each decimal value. But the values can only be 0-255. So we need something more restrictive than \d{1,3}
. So, again let’s go simple and limit the first of 3 possible digits to 1 or 2. Something like the following.
(?:[12]*\d{1,2}\.){3}[12]*\d{1,2}
After modifying the above code accordingly, the results haven’t much improved.
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py
The following should all pass:
192.0.2.235 -> valid
99.198.122.146 -> valid
18.101.25.153 -> valid
23.71.254.72 -> valid
100.100.100.100 -> valid
173.194.34.134 -> valid
212.58.241.131 -> valid
46.51.197.88 -> valid
The following should all fail:
256.256.256.256 -> valid
925.254.255.254 -> not valid
Looks Like the Solution
And you are of course saying that if the first of three digits is a 2, the range for the 2nd digit is 0-5 inclusive. And, if that 2nd digit is a 5, then the 3rd also has to be in the range 0-5 inclusive. We will start by looking for a 25[0-5]
, then a 2[0-4]\d
and test accordingly. But we will also have to check for a 0 or 1 as the first of three digits. That way we can exclude anything above 2xx. And, because we are using alternation (|
) to check for various combinations we need to add a few more non-capturing groups. And, we want at least one digit in each octet.
Note: [01]?
asks for a zero or 1 or nothing. ?
says zero or one of whatever precedes it. I.E. it is optional.
(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py
The following should all pass:
192.0.2.235 -> valid
99.198.122.146 -> valid
18.101.25.153 -> valid
23.71.254.72 -> valid
100.100.100.100 -> valid
173.194.34.134 -> valid
212.58.241.131 -> valid
46.51.197.88 -> valid
The following should all fail:
256.256.256.256 -> not valid
925.254.255.254 -> not valid
And, I think that does the trick. Though, we must note that if we are searching for valid IPv4 addresses in a text document setting we would need to add word boundaries to our regex. And, we should also likely allow for leading zeroes in each dotted decimal value. But I will ignore that for the moment.
Dotted Hexidecimal and Octal
I am not going to mess with all the naive versions discussed above. But, depending how I make out, there may be a failure or two shown and discussed.
Well, likely going to be easier than I thought. Decimal 255
=> hex 0xff
=> octal 0377
. Convention says 0x
before 2 hexidecimal digits. And 0
before 3 octal digits. For dotted formats, if 3 or less digits and no x
, it is considered decimal.
PS R:\learn\py_play> perl -e "printf('%o', 0xff);"
377
Which it turns out makes adding dotted hex and dotted octal surprisingly straight forward. The hex pattern is simply 0x[\da-f]{2}
. That is two of any decimal digit (\d
) or the letters a to f inclusive. And the octal: 0[0-3][0-7]{2}
. I.E. 0 to 3 inclusive followed by 2 digits between 0 and 7 inclusive. I am ignoring the leading 0x
or 0
in the explanations of the regexes.
Test Code
I did refactor the test code somewhat. I added a dictionary keyed on an address format. Each key’s value is an array, providing a format name/description, the varialbe for the valid tests and the variable for the not valid format tests. Then I provide an array of the tests I currently wish to run. I won’t show all the test cases. They will be displayed in the modules output.
And do note the continuation operator, \
, for the string defining the regex. And, because the test cases included capitalized hex digits, I needed to ignore case.
Also, in testing, one of the non specific cases (trailing decimal point/dot) passed when it should have failed. And, that’s because I was not using word boundaries. So, I have now included them in my regex. But, because of the way my tests are executed, I need to use start and end string delimiters, ^
and $
, instead of, specifically, word boundaries, \b
.
... ...
d_tst = {
'ddc': ['dotted decimal', ddc_ok, ddc_no],
'dhx': ['dotted hexadecimal', dhx_ok, dhx_no],
'doc': ['dotted octal', doc_ok, doc_no],
'dec': ['decimal', dec_ok, dec_no],
'hex': ['hexadecimal', hex_ok, hex_no],
'oct': ['octal', oct_ok, oct_no],
'mix': ['mixed octets', mix_ok, None],
'gen': ['non specific cases', None, gen_no]
}
rgx = re.compile(r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})\.){3}'\
'(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})$', re.IGNORECASE)
c_tsts = ['ddc', 'dhx', 'doc', 'mix', 'gen']
for f_tst in c_tsts:
print(f"\n{d_tst[f_tst][0]}")
if d_tst[f_tst][1] is not None:
print("\tThe following should all pass:")
for tst in d_tst[f_tst][1]:
r_tst = rgx.match(tst)
print(f"\t\t{tst} -> {'valid' if r_tst else 'not valid'}")
if d_tst[f_tst][2] is not None:
print("\tThe following should all fail:")
for tst in d_tst[f_tst][2]:
r_tst = rgx.match(tst)
print(f"\t\t{tst} -> {'valid' if r_tst else 'not valid'}")
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py
dotted decimal
The following should all pass:
192.0.2.235 -> valid
99.198.122.146 -> valid
18.101.25.153 -> valid
23.71.254.72 -> valid
100.100.100.100 -> valid
173.194.34.134 -> valid
212.58.241.131 -> valid
46.51.197.88 -> valid
The following should all fail:
256.256.256.256 -> not valid
925.254.255.254 -> not valid
dotted hexadecimal
The following should all pass:
0xC0.0x00.0x02.0xEB -> valid
0xFF.0x12.0xF1.0x1F -> valid
0x11.0x22.0x33.0x44 -> valid
The following should all fail:
0x100.0x11.0x11.0x11 -> not valid
0x11.0x100.0x11.0x11 -> not valid
0xx20.0x20.0x20.0x20 -> not valid
dotted octal
The following should all pass:
0300.0000.0002.0353 -> valid
0377.0377.0377.0377 -> valid
0100.0100.0100.0100 -> valid
0177.0.0.01 -> valid
The following should all fail:
0180.0100.0100.0100 -> not valid
0100.0100.0109.0100 -> not valid
mixed octets
The following should all pass:
0300.19.0.2 -> valid
99.0377.4.0002 -> valid
0xFF.255.0377.0x12 -> valid
non specific cases
The following should all fail:
0x20.0x50.0x2 -> not valid
.100.100.100.100 -> not valid
100..100.100.100. -> not valid
100.100.100.100. -> not valid
256.100.100.100.100 -> not valid
100.100.100.100.0x40 -> not valid
32 Bit Numbers
As mentioned above, an IP address is actually a 32-bit binary number. All other representations are conveniences (too some extent or other) for humans interacting with code and/or machines.
Now the maximum value for a 32 bit binary number can be represented in hex as 0xffffffff
. Each f
represents 4 bits of the 32 bit binary number. Specifically 1111
. And, for octal and decimal we get the following equivalent values.
PS R:\learn\py_play> perl -e "printf('%x', 0b11111111111111111111111111111111);"
ffffffff
PS R:\learn\py_play> perl -e "printf('%o', 0xffffffff);"
37777777777
PS R:\learn\py_play> perl -e "print(0xffffffff);"
4294967295
Hex and Octal
I will start with the non-decimal formats as they are once again pretty straightforward. Don’t see that being the case for the decimal format.
For both we are pretty much looking at something similar to the dotted format, but with more digits. And, unfortunately, as we are adding more alternative patterns, we will need another non-capturing group to enclose the new alternations. Something like the following on multiple lines for readability.
rgx = re.compile(r'^(?:(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})\.){3}'\
'(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})'\
'|0x[\da-f]{8}|0[0-3][0-7]{2,10})$', re.IGNORECASE)
c_tsts = ['ddc', 'dhx', 'doc', 'mix', 'hex', 'oct', 'gen']
I have also added some extra tests for the above regex. And, the regex seems to work as intended.
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py
dotted decimal
The following should all pass:
192.0.2.235 -> valid
99.198.122.146 -> valid
18.101.25.153 -> valid
23.71.254.72 -> valid
100.100.100.100 -> valid
173.194.34.134 -> valid
212.58.241.131 -> valid
46.51.197.88 -> valid
The following should all fail:
256.256.256.256 -> not valid
925.254.255.254 -> not valid
dotted hexadecimal
The following should all pass:
0xC0.0x00.0x02.0xEB -> valid
0xFF.0x12.0xF1.0x1F -> valid
0x11.0x22.0x33.0x44 -> valid
The following should all fail:
0x100.0x11.0x11.0x11 -> not valid
0x11.0x100.0x11.0x11 -> not valid
0xx20.0x20.0x20.0x20 -> not valid
dotted octal
The following should all pass:
0300.0000.0002.0353 -> valid
0377.0377.0377.0377 -> valid
0100.0100.0100.0100 -> valid
0177.0.0.01 -> valid
The following should all fail:
0180.0100.0100.0100 -> not valid
0100.0100.0109.0100 -> not valid
mixed octets
The following should all pass:
0300.19.0.2 -> valid
99.0377.4.0002 -> valid
0xFF.255.0377.0x12 -> valid
hexadecimal
The following should all pass:
0xC00002EC -> valid
0xFF12F11F -> valid
0x11223344 -> valid
The following should all fail:
0x100111111 -> not valid
0x111001111 -> not valid
octal
The following should all pass:
030000001353 -> valid
030000001354 -> valid
037704570437 -> valid
02110431504 -> valid
037777777777 -> valid
The following should all fail:
047777777777 -> not valid
0377777777777 -> not valid
non specific cases
The following should all fail:
0x20.0x50.0x2 -> not valid
.100.100.100.100 -> not valid
100..100.100.100. -> not valid
100.100.100.100. -> not valid
256.100.100.100.100 -> not valid
100.100.100.100.0x40 -> not valid
037777777778 -> not valid
011377000000000000008 -> not valid
Decimal
This one is going to take some work. Determining when the number is too large may not be possible. At least not easily.
We know our maximum permitted decimal value is 4294967295
. So for sure the following will be valid.
[1-3]?\d{0,9}
Adding that to our regex and testing the potential decimal only addresses, I get the following.
decimal
The following should all pass:
3221226219 -> valid
2130706433 -> valid
287454020 -> valid
4279431455 -> not valid
4294967295 -> not valid
The following should all fail:
4294967296 -> not valid
So, getting there. I think we can also add 41\d{0,8}|42[0-8]\d{0,7}
. Quick test
decimal
The following should all pass:
3221226219 -> valid
2130706433 -> valid
287454020 -> valid
4279431455 -> valid
4294967295 -> not valid
The following should all fail:
4294967296 -> not valid
Still have a few cases to deal with. And I expect this is going to get very lengthy. There will be a large number of alternatives for the decimal address format. Pretty much one for each digit. For example the next one we check for initial digits of 429
then limit what the next digit can be. Then 4294
, etc.
I will skip the intermediate steps. I am only going to show the tailend of the regex dealing specifically with a decimal number address format. Split over multiple lines for readability.
|[1-3]?\d{0,9}|41\d{0,8}|42[0-8]\d{0,7}|429[0-3]\d{0.6}|4294[0-8]\d{0,5}
|42949[0-5]\d{0,4}|429496[0-6]\d{0,3}|4294967[01]\d{0,2}|42949672[0-8]\d
|429496729[0-5]
Final Regex and Tests
Here’s my final regex and the test/validation results.
rgx = re.compile(r'^(?:(/:(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})\.){3}'\
'(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})'\
'|0x[\da-f]{8}|0[0-3][0-7]{2,10}'\
'|0x[\da-f]{8}|0[0-3][0-7]{2,10}'\
'|[1-3]?\d{0,9}|41\d{0,8}|42[0-8]\d{0,7}|429[0-3]\d{0.6}|4294[0-8]\d{0,5}'\
'|42949[0-5]\d{0,4}|429496[0-6]\d{0,3}|4294967[01]\d{0,2}|42949672[0-8]\d'\
'|429496729[0-5])$', re.IGNORECASE)
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py
dotted decimal
The following should all pass:
192.0.2.235 -> valid
99.198.122.146 -> valid
18.101.25.153 -> valid
23.71.254.72 -> valid
100.100.100.100 -> valid
173.194.34.134 -> valid
212.58.241.131 -> valid
46.51.197.88 -> valid
The following should all fail:
256.256.256.256 -> not valid
925.254.255.254 -> not valid
dotted hexadecimal
The following should all pass:
0xC0.0x00.0x02.0xEB -> valid
0xFF.0x12.0xF1.0x1F -> valid
0x11.0x22.0x33.0x44 -> valid
The following should all fail:
0x100.0x11.0x11.0x11 -> not valid
0x11.0x100.0x11.0x11 -> not valid
0xx20.0x20.0x20.0x20 -> not valid
dotted octal
The following should all pass:
0300.0000.0002.0353 -> valid
0377.0377.0377.0377 -> valid
0100.0100.0100.0100 -> valid
0177.0.0.01 -> valid
The following should all fail:
0180.0100.0100.0100 -> not valid
0100.0100.0109.0100 -> not valid
mixed octets
The following should all pass:
0300.19.0.2 -> valid
99.0377.4.0002 -> valid
0xFF.255.0377.0x12 -> valid
decimal
The following should all pass:
3221226219 -> valid
2130706433 -> valid
287454020 -> valid
4279431455 -> valid
4294967295 -> valid
The following should all fail:
4294967296 -> not valid
hexadecimal
The following should all pass:
0xC00002EC -> valid
0xFF12F11F -> valid
0x11223344 -> valid
The following should all fail:
0x100111111 -> not valid
0x111001111 -> not valid
octal
The following should all pass:
030000001353 -> valid
030000001354 -> valid
037704570437 -> valid
02110431504 -> valid
037777777777 -> valid
The following should all fail:
047777777777 -> not valid
0377777777777 -> not valid
non specific cases
The following should all fail:
0x20.0x50.0x2 -> not valid
.100.100.100.100 -> not valid
100..100.100.100. -> not valid
100.100.100.100. -> not valid
256.100.100.100.100 -> not valid
100.100.100.100.0x40 -> not valid
037777777778 -> not valid
011377000000000000008 -> not valid
Done
I had actually thought I might get a couple challenges done in this post. But, that’s not going to happen. This one is plenty long enough.
And, my regex passes the challenge in question.
Until next time, tap, tap, tap…
Resources
- Python: Explicit line joining
- The many faces of an IP address
- Regex Tuesday Challenge - Week Six (IPv4 Addresses)