This is the follow up to my post searching for social security numbers. US telephone numbers use the following format that can easily be matched with a regular expression.
(215) 555-1212
215-555-1212
215 555 1212
215.555.1212
2155551212
The phone number can be broken down into a series of character classes. Using egrep, character classes are written inside of square brackets. The character class [0-9] represents a single number from 0-9. You can expand upon this and match a series of numbers from the character class by following it with a number inside of curly braces. [3-7]{3} matches exactly three numbers in the range of 3 through 7. We will use this notation to build the three parts of the phone number.
You can also build character classes containing specific characters or symbols. After the first three numbers of the phone number there are a few possibilities for the next character. It can be a right paren, a hyphen, a space, or a period. It can also be none of these. The ? operator matches exactly zero or one instance. Putting these two concepts together, we would build a character class and use the ? operator: [)- .]?
OK, let’s combine all of these concepts to build our regex. This is just one solution that I’ve come up with to match a phone number. The beauty of Unix is that there are many other solutions that are correct; some of which are probably better than my solution.
egrep “[(]?[2-9]{1}[0-9]{2}[)-. ]?[2-9]{1}[0-9]{2}[-. ]?[0-9]{4}” filename
This reads: zero or one left paren followed by a single number in the range two through nine, followed by two numbers in the range zero through nine, followed by zero or one right paren, hyphen, period, or space, followed by a single number in the range two through nine, followed by two numbers in the range zero through nine, followed by zero or one hyphen, period, or space followed by four numbers in the range zero through nine.
Until the explosion of cell phones, US area codes followed the format: number from 2-9, a 0 or 1, followed by a number from 0-9. When additional area codes were needed to accommodate the growing number of phone numbers, the requirement that the middle digit be a 0 or 1 was dropped.
If you have any questions about this article, please post in the comments section.
Update:
I updated the regex thanks to the feedback from Boris. The first and fourth digits cannot contain a zero or one so I created two separate character classes to accommodate that requirement.