Matching a US Telephone Number With egrep Using Regular Expressions

This is the follow up to my post searching for social security numbers. US telephone numbers use the following format that can easily be matched with a regular expression.
(215) 555-1212
215-555-1212
215 555 1212
215.555.1212
2155551212

The phone number can be broken down into a series of character classes. Using egrep, character classes are written inside of square brackets. The character class [0-9] represents a single number from 0-9. You can expand upon this and match a series of numbers from the character class by following it with a number inside of curly braces. [3-7]{3} matches exactly three numbers in the range of 3 through 7. We will use this notation to build the three parts of the phone number.

You can also build character classes containing specific characters or symbols. After the first three numbers of the phone number there are a few possibilities for the next character. It can be a right paren, a hyphen, a space, or a period. It can also be none of these. The ? operator matches exactly zero or one instance. Putting these two concepts together, we would build a character class and use the ? operator: [)- .]?

OK, let’s combine all of these concepts to build our regex. This is just one solution that I’ve come up with to match a phone number. The beauty of Unix is that there are many other solutions that are correct; some of which are probably better than my solution.

egrep “[(]?[2-9]{1}[0-9]{2}[)-. ]?[2-9]{1}[0-9]{2}[-. ]?[0-9]{4}” filename

This reads: zero or one left paren followed by a single number in the range two through nine, followed by two numbers in the range zero through nine, followed by zero or one right paren, hyphen, period, or space, followed by a single number in the range two through nine, followed by two numbers in the range zero through nine, followed by zero or one hyphen, period, or space followed by four numbers in the range zero through nine.

Until the explosion of cell phones, US area codes followed the format: number from 2-9, a 0 or 1, followed by a number from 0-9. When additional area codes were needed to accommodate the growing number of phone numbers, the requirement that the middle digit be a 0 or 1 was dropped.

If you have any questions about this article, please post in the comments section.

Update:
I updated the regex thanks to the feedback from Boris. The first and fourth digits cannot contain a zero or one so I created two separate character classes to accommodate that requirement.

Searching for social security numbers in a file using a regular expression and egrep

egrep is a version of grep that supports extended regular expressions. egrep can be used to find all social security numbers in a file using a basic regex. All US social security numbers have the format: 123-45-6789. This can be broken into the regex containing a character class of three numeric digits, a dash, a character class of 2 numeric digits, a dash and finally a character class of four numeric digits.


egrep "[0-9]{3}[-][0-9]{2}[-][0-9]{4}" file

If the file contains social security numbers without dashes this regex will not match. To improve upon the regex you can use the ? operator that matches exactly 0 or 1 occurrence of the preceding character class.

egrep "[0-9]{3}[-]?[0-9]{2}[-]?[0-9]{4}" file

Placing the dash in its own character class is not required. I find that it makes the regex easier to read.

My next post will show you how to match a US telephone number using a slightly more complicated regular expression.