Matching a US Telephone Number With egrep Using Regular Expressions

This is the follow up to my post searching for social security numbers. US telephone numbers use the following format that can easily be matched with a regular expression.
(215) 555-1212
215 555 1212

The phone number can be broken down into a series of character classes. Using egrep, character classes are written inside of square brackets. The character class [0-9] represents a single number from 0-9. You can expand upon this and match a series of numbers from the character class by following it with a number inside of curly braces. [3-7]{3} matches exactly three numbers in the range of 3 through 7. We will use this notation to build the three parts of the phone number.

You can also build character classes containing specific characters or symbols. After the first three numbers of the phone number there are a few possibilities for the next character. It can be a right paren, a hyphen, a space, or a period. It can also be none of these. The ? operator matches exactly zero or one instance. Putting these two concepts together, we would build a character class and use the ? operator: [)- .]?

OK, let’s combine all of these concepts to build our regex. This is just one solution that I’ve come up with to match a phone number. The beauty of Unix is that there are many other solutions that are correct; some of which are probably better than my solution.

egrep “[(]?[2-9]{1}[0-9]{2}[)-. ]?[2-9]{1}[0-9]{2}[-. ]?[0-9]{4}” filename

This reads: zero or one left paren followed by a single number in the range two through nine, followed by two numbers in the range zero through nine, followed by zero or one right paren, hyphen, period, or space, followed by a single number in the range two through nine, followed by two numbers in the range zero through nine, followed by zero or one hyphen, period, or space followed by four numbers in the range zero through nine.

Until the explosion of cell phones, US area codes followed the format: number from 2-9, a 0 or 1, followed by a number from 0-9. When additional area codes were needed to accommodate the growing number of phone numbers, the requirement that the middle digit be a 0 or 1 was dropped.

If you have any questions about this article, please post in the comments section.

I updated the regex thanks to the feedback from Boris. The first and fourth digits cannot contain a zero or one so I created two separate character classes to accommodate that requirement.

  • Boris

    You may want to update the regex so that only the first and fourth digit do not contain a 0 or a 1. As an example, the regex you provided would not work for (216)830-xxxx. Otherwise, looks good!

    • admin

      Good catch. I intended to create separate character classes for the first and fourth digit. You will see that I used the range 2-9 but didn’t expand it out further for the rest of the digits that can contain a 0 or 1.

  • Mike B

    Why use [0-9] as a range when \d works? Just given that it sums up the entire class in fewer characters…

    • admin

      Good point regarding \d. Since I wrote this as an introduction to regular expressions, I found the numeric range [0-9] to be more readable. We could certainly obfuscate the regex 🙂