A common exercise on regular expression challenge sites is matching IPv4 addresses. Most only look at matching the dotted decimal format. But, there are other formats allowed. In fact, IPv4 addresses may be represented in any notation expressing a 32-bit integer value.

I am going to start with the dotted decimal, then see if I can add the dotted hexadecimal, dotted octal, hexadecimal, decimal and octal representations. I expect this will become one messy regular expression. So, I may cheat and use Python to apply repeated expressions until one matches or they all fail. May use a few extra CPU cycles, but may be a lot easier to understand.

Dotted Decimal

The numeric values in any dotted format are commonly referred to as ‘octets’. That’s because whether decimal, hexidecimal or octal they are representing 8 binary digits from the full 32 bit binary IPv4 address.

Naive Approach

For most of us, our initial thought would be 4 octets of 1-3 digits separated by periods/dots. And, since regexs use . to represent any character we will need to escape our decimal point, \., in our expression. Something like the following. Since I only want to check if the input is a valid IPv4 address, I am not specifying word boundaries and am using non-capturing groups, (?: ), for the first three elements. Which by the way include the decimal point.

 (:?\d{1,3}\.){3}\d{1,3}

Some Python code to test the above pattern.

N.B. most of the test cases I am using come from the Regex Tuesday Challenge - Week Six. Thank you Callum Macrae.

import re

ddc_ok = [
  '192.0.2.235',
  '99.198.122.146',
  '18.101.25.153',
  '23.71.254.72',
  '100.100.100.100',
  '173.194.34.134',
  '212.58.241.131',
  '46.51.197.88',
]
ddc_no = [
  '256.256.256.256',
  '925.254.255.254'
]

rgx = re.compile(r'(?:\d{1,3}\.){3}\d{1,3}')

print("\nThe following should all pass:")
for tst in ddc_ok:
  r_tst = rgx.match(tst)
  print(f"\t{tst} -> {'valid' if r_tst else 'not valid'}")

print("\nThe following should all fail:")
for tst in ddc_no:
  r_tst = rgx.match(tst)
  print(f"\t{tst} -> {'valid' if r_tst else 'not valid'}")
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py

The following should all pass:
        192.0.2.235 -> valid
        99.198.122.146 -> valid
        18.101.25.153 -> valid
        23.71.254.72 -> valid
        100.100.100.100 -> valid
        173.194.34.134 -> valid
        212.58.241.131 -> valid
        46.51.197.88 -> valid

The following should all fail:
        256.256.256.256 -> valid
        925.254.255.254 -> valid

A Little Less Naive

And we can see that the naive approach does not work. Our regular expression allows any digit in any of the three locations in each decimal value. But the values can only be 0-255. So we need something more restrictive than \d{1,3}. So, again let’s go simple and limit the first of 3 possible digits to 1 or 2. Something like the following.

 (?:[12]*\d{1,2}\.){3}[12]*\d{1,2}

After modifying the above code accordingly, the results haven’t much improved.

(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py

The following should all pass:
        192.0.2.235 -> valid
        99.198.122.146 -> valid
        18.101.25.153 -> valid
        23.71.254.72 -> valid
        100.100.100.100 -> valid
        173.194.34.134 -> valid
        212.58.241.131 -> valid
        46.51.197.88 -> valid

The following should all fail:
        256.256.256.256 -> valid
        925.254.255.254 -> not valid

Looks Like the Solution

And you are of course saying that if the first of three digits is a 2, the range for the 2nd digit is 0-5 inclusive. And, if that 2nd digit is a 5, then the 3rd also has to be in the range 0-5 inclusive. We will start by looking for a 25[0-5], then a 2[0-4]\d and test accordingly. But we will also have to check for a 0 or 1 as the first of three digits. That way we can exclude anything above 2xx. And, because we are using alternation (|) to check for various combinations we need to add a few more non-capturing groups. And, we want at least one digit in each octet.

Note: [01]? asks for a zero or 1 or nothing. ? says zero or one of whatever precedes it. I.E. it is optional.

 (?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)

(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py

The following should all pass:
        192.0.2.235 -> valid
        99.198.122.146 -> valid
        18.101.25.153 -> valid
        23.71.254.72 -> valid
        100.100.100.100 -> valid
        173.194.34.134 -> valid
        212.58.241.131 -> valid
        46.51.197.88 -> valid

The following should all fail:
        256.256.256.256 -> not valid
        925.254.255.254 -> not valid

And, I think that does the trick. Though, we must note that if we are searching for valid IPv4 addresses in a text document setting we would need to add word boundaries to our regex. And, we should also likely allow for leading zeroes in each dotted decimal value. But I will ignore that for the moment.

Dotted Hexidecimal and Octal

I am not going to mess with all the naive versions discussed above. But, depending how I make out, there may be a failure or two shown and discussed.

Well, likely going to be easier than I thought. Decimal 255 => hex 0xff => octal 0377. Convention says 0x before 2 hexidecimal digits. And 0 before 3 octal digits. For dotted formats, if 3 or less digits and no x, it is considered decimal.

PS R:\learn\py_play> perl -e "printf('%o', 0xff);"
377

Which it turns out makes adding dotted hex and dotted octal surprisingly straight forward. The hex pattern is simply 0x[\da-f]{2}. That is two of any decimal digit (\d) or the letters a to f inclusive. And the octal: 0[0-3][0-7]{2}. I.E. 0 to 3 inclusive followed by 2 digits between 0 and 7 inclusive. I am ignoring the leading 0x or 0 in the explanations of the regexes.

Test Code

I did refactor the test code somewhat. I added a dictionary keyed on an address format. Each key’s value is an array, providing a format name/description, the varialbe for the valid tests and the variable for the not valid format tests. Then I provide an array of the tests I currently wish to run. I won’t show all the test cases. They will be displayed in the modules output.

And do note the continuation operator, \, for the string defining the regex. And, because the test cases included capitalized hex digits, I needed to ignore case.

Also, in testing, one of the non specific cases (trailing decimal point/dot) passed when it should have failed. And, that’s because I was not using word boundaries. So, I have now included them in my regex. But, because of the way my tests are executed, I need to use start and end string delimiters, ^ and $, instead of, specifically, word boundaries, \b.

... ...
d_tst = {
  'ddc': ['dotted decimal', ddc_ok, ddc_no],
  'dhx': ['dotted hexadecimal', dhx_ok, dhx_no],
  'doc': ['dotted octal', doc_ok, doc_no],
  'dec': ['decimal', dec_ok, dec_no],
  'hex': ['hexadecimal', hex_ok, hex_no],
  'oct': ['octal', oct_ok, oct_no],
  'mix': ['mixed octets', mix_ok, None],
  'gen': ['non specific cases', None, gen_no]
}

rgx = re.compile(r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})\.){3}'\
  '(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})$', re.IGNORECASE)

c_tsts = ['ddc', 'dhx', 'doc', 'mix', 'gen']

for f_tst in c_tsts:
  print(f"\n{d_tst[f_tst][0]}")
  if d_tst[f_tst][1] is not None:
    print("\tThe following should all pass:")
    for tst in d_tst[f_tst][1]:
      r_tst = rgx.match(tst)
      print(f"\t\t{tst} -> {'valid' if r_tst else 'not valid'}")

  if d_tst[f_tst][2] is not None:
    print("\tThe following should all fail:")
    for tst in d_tst[f_tst][2]:
      r_tst = rgx.match(tst)
      print(f"\t\t{tst} -> {'valid' if r_tst else 'not valid'}")
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py

dotted decimal
        The following should all pass:
                192.0.2.235 -> valid
                99.198.122.146 -> valid
                18.101.25.153 -> valid
                23.71.254.72 -> valid
                100.100.100.100 -> valid
                173.194.34.134 -> valid
                212.58.241.131 -> valid
                46.51.197.88 -> valid
        The following should all fail:
                256.256.256.256 -> not valid
                925.254.255.254 -> not valid

dotted hexadecimal
        The following should all pass:
                0xC0.0x00.0x02.0xEB -> valid
                0xFF.0x12.0xF1.0x1F -> valid
                0x11.0x22.0x33.0x44 -> valid
        The following should all fail:
                0x100.0x11.0x11.0x11 -> not valid
                0x11.0x100.0x11.0x11 -> not valid
                0xx20.0x20.0x20.0x20 -> not valid

dotted octal
        The following should all pass:
                0300.0000.0002.0353 -> valid
                0377.0377.0377.0377 -> valid
                0100.0100.0100.0100 -> valid
                0177.0.0.01 -> valid
        The following should all fail:
                0180.0100.0100.0100 -> not valid
                0100.0100.0109.0100 -> not valid

mixed octets
        The following should all pass:
                0300.19.0.2 -> valid
                99.0377.4.0002 -> valid
                0xFF.255.0377.0x12 -> valid

non specific cases
        The following should all fail:
                0x20.0x50.0x2 -> not valid
                .100.100.100.100 -> not valid
                100..100.100.100. -> not valid
                100.100.100.100. -> not valid
                256.100.100.100.100 -> not valid
                100.100.100.100.0x40 -> not valid

32 Bit Numbers

As mentioned above, an IP address is actually a 32-bit binary number. All other representations are conveniences (too some extent or other) for humans interacting with code and/or machines.

Now the maximum value for a 32 bit binary number can be represented in hex as 0xffffffff. Each f represents 4 bits of the 32 bit binary number. Specifically 1111. And, for octal and decimal we get the following equivalent values.

PS R:\learn\py_play> perl -e "printf('%x', 0b11111111111111111111111111111111);"
ffffffff
PS R:\learn\py_play> perl -e "printf('%o', 0xffffffff);"
37777777777
PS R:\learn\py_play> perl -e "print(0xffffffff);"
4294967295

Hex and Octal

I will start with the non-decimal formats as they are once again pretty straightforward. Don’t see that being the case for the decimal format.

For both we are pretty much looking at something similar to the dotted format, but with more digits. And, unfortunately, as we are adding more alternative patterns, we will need another non-capturing group to enclose the new alternations. Something like the following on multiple lines for readability.

rgx = re.compile(r'^(?:(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})\.){3}'\
  '(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})'\
  '|0x[\da-f]{8}|0[0-3][0-7]{2,10})$', re.IGNORECASE)

c_tsts = ['ddc', 'dhx', 'doc', 'mix', 'hex', 'oct', 'gen']

I have also added some extra tests for the above regex. And, the regex seems to work as intended.

(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py

dotted decimal
        The following should all pass:
                192.0.2.235 -> valid
                99.198.122.146 -> valid
                18.101.25.153 -> valid
                23.71.254.72 -> valid
                100.100.100.100 -> valid
                173.194.34.134 -> valid
                212.58.241.131 -> valid
                46.51.197.88 -> valid
        The following should all fail:
                256.256.256.256 -> not valid
                925.254.255.254 -> not valid

dotted hexadecimal
        The following should all pass:
                0xC0.0x00.0x02.0xEB -> valid
                0xFF.0x12.0xF1.0x1F -> valid
                0x11.0x22.0x33.0x44 -> valid
        The following should all fail:
                0x100.0x11.0x11.0x11 -> not valid
                0x11.0x100.0x11.0x11 -> not valid
                0xx20.0x20.0x20.0x20 -> not valid

dotted octal
        The following should all pass:
                0300.0000.0002.0353 -> valid
                0377.0377.0377.0377 -> valid
                0100.0100.0100.0100 -> valid
                0177.0.0.01 -> valid
        The following should all fail:
                0180.0100.0100.0100 -> not valid
                0100.0100.0109.0100 -> not valid

mixed octets
        The following should all pass:
                0300.19.0.2 -> valid
                99.0377.4.0002 -> valid
                0xFF.255.0377.0x12 -> valid

hexadecimal
        The following should all pass:
                0xC00002EC -> valid
                0xFF12F11F -> valid
                0x11223344 -> valid
        The following should all fail:
                0x100111111 -> not valid
                0x111001111 -> not valid

octal
        The following should all pass:
                030000001353 -> valid
                030000001354 -> valid
                037704570437 -> valid
                02110431504 -> valid
                037777777777 -> valid
        The following should all fail:
                047777777777 -> not valid
                0377777777777 -> not valid

non specific cases
        The following should all fail:
                0x20.0x50.0x2 -> not valid
                .100.100.100.100 -> not valid
                100..100.100.100. -> not valid
                100.100.100.100. -> not valid
                256.100.100.100.100 -> not valid
                100.100.100.100.0x40 -> not valid
                037777777778 -> not valid
                011377000000000000008 -> not valid

Decimal

This one is going to take some work. Determining when the number is too large may not be possible. At least not easily.

We know our maximum permitted decimal value is 4294967295. So for sure the following will be valid.

 [1-3]?\d{0,9}

Adding that to our regex and testing the potential decimal only addresses, I get the following.

decimal
        The following should all pass:
                3221226219 -> valid
                2130706433 -> valid
                287454020 -> valid
                4279431455 -> not valid
                4294967295 -> not valid
        The following should all fail:
                4294967296 -> not valid

So, getting there. I think we can also add 41\d{0,8}|42[0-8]\d{0,7}. Quick test

decimal
        The following should all pass:
                3221226219 -> valid
                2130706433 -> valid
                287454020 -> valid
                4279431455 -> valid
                4294967295 -> not valid
        The following should all fail:
                4294967296 -> not valid

Still have a few cases to deal with. And I expect this is going to get very lengthy. There will be a large number of alternatives for the decimal address format. Pretty much one for each digit. For example the next one we check for initial digits of 429 then limit what the next digit can be. Then 4294, etc.

I will skip the intermediate steps. I am only going to show the tailend of the regex dealing specifically with a decimal number address format. Split over multiple lines for readability.

 |[1-3]?\d{0,9}|41\d{0,8}|42[0-8]\d{0,7}|429[0-3]\d{0.6}|4294[0-8]\d{0,5}  |42949[0-5]\d{0,4}|429496[0-6]\d{0,3}|4294967[01]\d{0,2}|42949672[0-8]\d  |429496729[0-5]

Final Regex and Tests

Here’s my final regex and the test/validation results.

rgx = re.compile(r'^(?:(/:(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})\.){3}'\
  '(?:25[0-5]|2[0-4]\d|[01]?\d\d?|0x[\da-f]{2}|0[0-3][0-7]{2})'\
  '|0x[\da-f]{8}|0[0-3][0-7]{2,10}'\
  '|0x[\da-f]{8}|0[0-3][0-7]{2,10}'\
  '|[1-3]?\d{0,9}|41\d{0,8}|42[0-8]\d{0,7}|429[0-3]\d{0.6}|4294[0-8]\d{0,5}'\
  '|42949[0-5]\d{0,4}|429496[0-6]\d{0,3}|4294967[01]\d{0,2}|42949672[0-8]\d'\
  '|429496729[0-5])$', re.IGNORECASE)
(g4p-3.11) PS R:\learn\regex_ai\blog> python ipv4.py

dotted decimal
        The following should all pass:
                192.0.2.235 -> valid
                99.198.122.146 -> valid
                18.101.25.153 -> valid
                23.71.254.72 -> valid
                100.100.100.100 -> valid
                173.194.34.134 -> valid
                212.58.241.131 -> valid
                46.51.197.88 -> valid
        The following should all fail:
                256.256.256.256 -> not valid
                925.254.255.254 -> not valid

dotted hexadecimal
        The following should all pass:
                0xC0.0x00.0x02.0xEB -> valid
                0xFF.0x12.0xF1.0x1F -> valid
                0x11.0x22.0x33.0x44 -> valid
        The following should all fail:
                0x100.0x11.0x11.0x11 -> not valid
                0x11.0x100.0x11.0x11 -> not valid
                0xx20.0x20.0x20.0x20 -> not valid

dotted octal
        The following should all pass:
                0300.0000.0002.0353 -> valid
                0377.0377.0377.0377 -> valid
                0100.0100.0100.0100 -> valid
                0177.0.0.01 -> valid
        The following should all fail:
                0180.0100.0100.0100 -> not valid
                0100.0100.0109.0100 -> not valid

mixed octets
        The following should all pass:
                0300.19.0.2 -> valid
                99.0377.4.0002 -> valid
                0xFF.255.0377.0x12 -> valid

decimal
        The following should all pass:
                3221226219 -> valid
                2130706433 -> valid
                287454020 -> valid
                4279431455 -> valid
                4294967295 -> valid
        The following should all fail:
                4294967296 -> not valid

hexadecimal
        The following should all pass:
                0xC00002EC -> valid
                0xFF12F11F -> valid
                0x11223344 -> valid
        The following should all fail:
                0x100111111 -> not valid
                0x111001111 -> not valid

octal
        The following should all pass:
                030000001353 -> valid
                030000001354 -> valid
                037704570437 -> valid
                02110431504 -> valid
                037777777777 -> valid
        The following should all fail:
                047777777777 -> not valid
                0377777777777 -> not valid

non specific cases
        The following should all fail:
                0x20.0x50.0x2 -> not valid
                .100.100.100.100 -> not valid
                100..100.100.100. -> not valid
                100.100.100.100. -> not valid
                256.100.100.100.100 -> not valid
                100.100.100.100.0x40 -> not valid
                037777777778 -> not valid
                011377000000000000008 -> not valid

Done

I had actually thought I might get a couple challenges done in this post. But, that’s not going to happen. This one is plenty long enough.

And, my regex passes the challenge in question.

Until next time, tap, tap, tap…

Resources