A example of Regular Express
Blogs20102010-11-13
According to my knowledge, Perl has 2 legs: Regular Express and Hash Table. Having them, Perl is very powerful to solve real issues; without them, Perl is just a Shell ++.
-
Regular Express (RE)
All the procedure languages have Regular Express: PHP, Perl, Python, Ruby. As far as I know, they are very similar: all inherit and expand from Perl. Perl is the originator of RE.
e.g, PHP has RE, and does a significant improvement that it can match many regular expressions at once which Perl can not, here is a example of PHPâs preg_replace function which can operate array, as well as scalar variable.<?php $patterns = array ('/(19|20)(d{2})-(d{1,2})-(d{1,2})/', '/^s*{(w+)}s*=/'); $replace = array ('3/4/12', '$1 ='); echo preg_replace($patterns, $replace, '{startDate} = 1999-5-27'); ?> - output: $startDate = 5/27/1999The above $patterns and $replaces can be array, which is not implemented in Perl.
However, that doesnât mean RE in PHP is more super in Perl.
Actually, Perlâs m//, s///, tr/// plus other functions (grep, map) make parsing much easier and quicker than others. - Hash Table
This Hash Table (%hash or reference: $hash_ref) is different from Javaâs HashTable. Java makes every see the bottom, itâs data structure are complex and stupid (sometimes). Perlâs Hash Table (as well as its array) are super. It makes things easier, and intuitionistic.
We focus on RE. Here is a comparation of PHPâs preg_* function and Perl version RE functions.
| php | perl |
|---|---|
| preg_match | m// |
| preg_replace | s///, tr/// |
| preg_filter | s///, tr/// |
| preg_grep | grep, s///,tr/// |
| preg_match_all | match, m// |
| preg_quote | s/// |
| preg_split | s/// |
From the table, we can see Perlâs RE is more clear and compact.
There are many pm modules in CPAN which extends RE for parsing and extract data, e.g for parsing XML, normally they use SAX and DOM methods.
Here I wrote a simple example, letâs say download a webpage from craigslist.org, than parse the page, to get extracted data:
(1) Firstly, we download the webpage by using generic commend âwgetâ.
wget http://vancouver.en.craigslist.ca/web/
(2) Secondly, after the $html page is downloaded in memory, we can use Perlâs RE to extract data, as the following:
$html =~ m{
Date:
(.*?) # Date
<br
(?:.*?)
Replys+to:
(?:.*?)
<as(?:.*?)>
(.*?) # email
</a>
(?:.*?)
(.*?) # content
}sgix ) { my ( $date, $email, $t3 ) = ( $1, $2, $3 ); ... } in this example, we need 3 information: phone, url, and email address. The following sub-routines do the job and get accurate result.
RE is used to perfectly launched in such case.
(a) parse html and extract phone number:
sub get_phone
{
my ($self, $html) = @_;
return unless $html;
$html =~ s/<img.*?>//g;
my ($phone) = $html =~ m{(?:b|<b>)?([d-().]{10,})(?:b|</b>|s)}s;
return unless($phone);
return if ($phone=~m/.{10,}/); # more..........
return if ($phone=~m/(?:ds){3,}/); # 5 0 0 0 0 0
$phone =~ s/^s+// if ($phone=~m/^s+/); # ' 123'
$phone =~ s/s+$// if ($phone=~m/s+$/); # '123 '
$phone =~ s/^.+// if ($phone=~m/^./); # '.1(604)'
$phone =~ s/^-+// if ($phone=~m/^-/); # '-1(604)'
$phone = '(' . $phone if ($phone=~m")" && $phone!~m"(");
$phone =~ s/s/-/g if ($phone=~m/s/); # '123 456 7890'
$phone =~ s/-($// if ($phone=~m/-($/); # '6789-('
return $phone;
}(b) extract url from html.
sub get_url
{
my ($self, $html) = @_;
return unless $html;
my ($url) = $html =~ m{((http://|www.)(?:[w-]+.){1,5}w+(/S*)?)}sig;
unless ($url) {
my $pattern = "(.com|.ca|.info|.us|.tv|.gov)";
if ($html=~m/$pattern/i) {
($url) = $html =~ m{[^@](?:b)((?:[w-]+.){1,5}(com|us|info|ca|jpg|png|jpeg|gif)(/S*)?)}sig;
}
}
$url =~ s/<.*$// if ($web && $web=~m/<.*$/);
$url =~ s/">.*$// if ($web && $web=~m/">/);
$url =~ s/&/&/g if ($web && $web=~m/&/);
$url =~ s/S$// if ($web && $web=~m/["';,?]$/);
return $url;
}(c) extract email:
sub get_email {
my ( $self, $str ) = @_;
return unless $str;
if ( $str =~ m/@/ ) {
$str =~ s/<a.*?>//s;
$str =~ s/</a>.*$//s; # </a>
$str = $self->trim($str);
}
else {
$str = '';
}
return $str;
}(d) a RE subroutine to trim space on both front and tail of string.
sub trim
{
my ($self, $str) = @_;
return '' unless $str;
$str =~ s/ / /g if ($str =~ m/ /);
$str =~ s/&/&/g if ($str =~ m/&/);
$str =~ s/^s+// if ($str =~ m/^s/);
$str =~ s/s+$// if ($str =~ m/s$/);
return $str;
}By using above 4 sub-routines to parse the html content from craigslist, the extracted data are exactly what we want, and can suit most of variable formats.
It is easy, simple, with a lot of time saved.
A bonus question is that how can we print out a word-frequency or line-frequency summary in this html?
To do this, we have to parse out each word in the input stream. Weâll pretend that by word we mean chunk of alphabetics, hyphens, or apostrophes, rather than the non-whitespace chunk idea of a word given in the previous question:
while (<>) {
while ( /(b[^W_d][w'-]+b)/g ) { # misses "`sheep'"
$seen{$1}++;
}
}
while ( ($word, $count) = each %seen ) {
print "$count $wordn";
}The above is to parse the whole html contents (multi-lines), if we want to do the same thing for individual line, we can do like this:
while (<>) {
$seen{$_}++;
}
while ( ($line, $count) = each %seen ) {
print "$count $line";
}The above is the complete implementation of RE usage: to parse and extract different contents from the original html.
