查看完整版本: grep函数介绍

OnlyOne 2006-9-25 13:33

grep函数介绍

Perl :grep,map nad sort (grep 部分)
翻 译:SaladJonk
审 校: qiang
出 处:中国Perl协会 FPC
原 名:Perl:grep,map and sort
作 者:Richard Anderson
原 文:[url]http://xrl.us/gcqq[/url]
发 表:2002

The grep function

(If you are new to Perl, skip the next two paragraphs and proceed to the “Grep vs. loops” example below. Hang loose, you’ll pick it up as you go along.)
grep BLOCK LIST
grep EXPR, LIST

The grep function evaluates the BLOCK or EXPR for each element of LIST, locally setting the $_ variable equal to each element. BLOCK is one or more Perl statements delimited by curly brackets. LIST is an ordered set of values. EXPR is one or more variables, operators, literals, functions, or subroutine calls. Grep returns a list of those elements for which the EXPR or BLOCK evaluates to TRUE. If there are multiple statements in the BLOCK, the last statement determines whether the BLOCK evaluates to TRUE or FALSE. LIST can be a list or an array. In a scalar context, grep returns the number of times the expression was TRUE.

Avoid modifying $_ in grep’s BLOCK or EXPR, as this will modify the elements of LIST. Also, avoid using the list returned by grep as an lvalue, as this will modify the elements of LIST. (An lvalue is a variable on the left side of an assignment statement.) Some Perl hackers may try to exploit these features, but I recommend that you avoid this confusing style of programming.

grep函数

(如果你是个Perl的新手,你可以先跳过下面的两段,直接到 Grep vs.loops 样例这一部分,放心,在后面你还会遇到它)


<pre>
grep BLOCK LIST
grep EXPR, LIST
</pre>
grep 函数会用 LIST 中的元素对 BLOCK 或 EXPR 求值,而且会把局部变量 $_ 设置为当前所用的 LIST 中的元素。BLOCK 块是一个或多个由花括号分隔开的Perl 语句,而 List 则是一有序列表。EXPR 是一个或多个变量,操作符,字符,函数,子程序调用的组成的表达式。Grep 会对 BLOCK 块或 EXPR 进行求值,将求值为%{color:red}真%的元素加入到 Grep 返回列表中。如果 BLOCK 块由多个语句组成,那么 Grep 以 BLOCK 中的最后一条语句的求值为准。LIST 可以是一个列表也可以是一个数组。在标量上下文中,grep 返回的是 BLOCK 或 EXPR 求值为真的元素个数。

请避免在 BLOCK 或 EXPR 块中修改 $_ ,因为这会相应的修改 LIST 中元素的值。同时还要避免把 grep 返回的列表做为左值使用,因为这也会修改 LIST 中的元素。(所谓左值变量就是一个在赋值表达式左边的变量)。一些 Perl hackers 可能会利用这个所谓的”特性”,但是我建议你不要使用这种混乱的编程风格.

Grep vs. loops

This example prints any lines in the file named myfile that contain the (case-insensitive) strings terrorism or nuclear:



open FILE "<myfile" or die "Can't open myfile: $!";
print grep /terrorism|nuclear/i, <FILE>;

This code consumes a lot of memory for large files because grep evaluates its second argument in a list context, so the diamond operator (<>) returns the entire file. A more memory-efficient way to do the same thing is:


while ($line = <FILE>) {
    if ($line =~ /terrorism|nuclear/i) { print $line }
}
The above examples show that a loop can do anything that grep can do. So why use grep? The glib answer is that grep is more Perlish whereas loops are more C-like. A better answer is that (1) grep makes it obvious to the reader that you are selecting elements from a list, and (2) grep is more succinct than a loop. (Software engineers would say that grep has more cohesion than a loop.) Bottom line: if you are not experienced with Perl, go ahead and use loops; as you become familiar with Perl, take advantage of power tools like grep.

grep 与循环

这个例子打印出 myfile 这个文件中含有 terriosm 和 nuclear 的行(大小写不敏感).



open FILE "<myfile" or die "Can't open myfile: $!";
print grep /terrorism|nuclear/i, <FILE>;

对于文件很大的情况,这段代码耗费很多内存。因为 grep 把它的第二个参数作为一个列表上下文看待,所以 < > 操作符返回的是整个的文件。更有效的代码应该这样写:


while ($line = <FILE>) {
    if ($line =~ /terrorism|nuclear/i) { print $line }
}
通过上面可以看到,使用循环可以完成所有 grep 可以完成的工作。那为什么我们还要使用 grep 呢?一个直观的答案是 grep 的风格更像 Perl,而 loops(循环)则是 C 的风格。一个更好的答案是,首先, grep 很直观的告诉读者正在进行的操作是从一串值中选出想要的。其次,grep 比循环简洁。(用软件工程的说法就是 grep 比循环更具有内聚力)。基本上,如果你对 Perl 不是很熟悉,随便你使用循环。否则,你应该多使用像 grep 这样的强大工具.

Count array elements that match a pattern

In a scalar context, grep returns a count of the selected elements.



$num_apple = grep /^apple$/i, @fruits;

The ^ and $ metacharacters anchor the regular expression to the beginning and end of the string, respectively, so that grep selects apple but not pineapple.

Extract unique elements from a list



@unique = grep { ++$count{$_} < 2 }
               qw(a b a c d d e f g f h h);
print "@unique\n";

输出结果: a b c d e f g h
The $count{$_} is a single element of a Perl hash, which is a list of key-value pairs. (The meaning of “hash” in Perl is related to, but not identical to, the meaning of “hash” in computer science.) The hash keys are the elements of grep’s input list; the hash values are running counts of how many times an element has passed through grep’s BLOCK. The expression is true the first time an element occurs in the list and false for all subsequent occurences.

Extract list elements that occur exactly twice



@crops = qw(wheat corn barley rice corn soybean hay
            alfalfa rice hay beets corn hay);
@duplicates = grep { $count{$_} == 2 }
              grep { ++$count{$_} > 1 } @crops;
print "@duplicates\n";

输出结果: rice
The second argument to grep is “evaluated in a list context” before the first list element is passed to grep’s BLOCK or EXPR. This means that the grep on the right completely loads the %count hash before the grep on the left begins evaluating its BLOCK.

计算数组中匹配给定模式的元素个数

在一个标量上下文中,grep 返回的是匹配的元素个数.

$num_apple = grep /^apple$/i, @fruits;
^ 和 $ 匹配符的联合使用指定了只匹配那些以 apple 开头且同时以 apple 结尾的元素。这里 grep 匹配 apple 但是 pineapple 就不匹配。
输出列表中的不同元素

@unique = grep { ++$count{$_} < 2 }
               qw(a b a c d d e f g f h h);
print "@unique\n";

输出结果: a b c d e f g h$count{$_} 是 Perl 散列中的一个元素,是一个键值对 ( Perl中的散列和计算机科学中的哈希表有关系,但不完全相同) 这里 count 散列的键就是输入列表中的各个值,而各键对应的值就是该键是否使 BLOCK 估值为真的次数。当一个值第一次出现的时候 BLOCK 的值被估为真(因为小于2),当该值再次出现的时候就会被估计为假(因为等于或大于2)。
取出列表中出现两次的值



@crops = qw(wheat corn barley rice corn soybean hay
            alfalfa rice hay beets corn hay);
@duplicates = grep { $count{$_} == 2 }
              grep { ++$count{$_} > 1 } @crops;
print "@duplicates\n";
在 grep 的第一个列表元素被传给 BLOCK 或 EXPR 块前,第二个参数被当作列表上下文看待。这意味着,第二个 grep 将在左边的 grep 开始对 BLOCK 进行估值之前完全读入 count 散列。

List text files in the current directory

@files = grep { -f and -T } glob '* .*';
print "@files\n";
The glob function is an OS-independent emulation of the filename expansion that the Unix shell does. The lone asterisk means “give me all the files in the current directory except those beginning with a period”; the .* means “give me all the files in the current directory beginning with a period”. The -f and -T file test operators return TRUE for plain and text files respectively. Testing with -f and -T is more efficient than testing with only -T because the -T operator is not evaluated if a file fails the less costly -f test.
Select array elements and eliminate duplicates


@array = qw(To be or not to be that is the question);
print "@array\n";
@found_words =
    grep { $_ =~ /b|o/i and ++$counts{$_} < 2; } @array;
print "@found_words\n";

输出结果:
To be or not to be that is the question
To be or not to question

The logical expression $_ =~ /b|o/i uses the match operator to select words that contain b or o (case-insensitive). Putting the match operator test before the hash increment test is slightly more efficient than vice-versa (for this example): if the left-hand expression is FALSE, the right-hand expression is not evaluated.
Select elements from a 2-D array where x < y



# An array of references to anonymous arrays
@data_points = ( [ 5, 12 ], [ 20, -3 ],
                 [ 2, 2 ], [ 13, 20 ] );
@y_gt_x = grep { $_->[0] < $_->[1] } @data_points;
foreach $xy (@y_gt_x) { print "$xy->[0], $xy->[1]\n" }

输出结果:
5, 12
13, 20

Search a simple database for restaurants
This example is not a practical way to implement a database, but does illustrate that the only limit to the complexity of grep’s block is the amount of virtual memory available to the program.


# @database is array of references to anonymous hashes
@database = (
    { name      => "Wild Ginger",
      city      => "Seattle",
      cuisine   => "Asian Thai Chinese Korean Japanese",
      expense   => 4,
      music     => "\0",
      meals     => "lunch dinner",
      view      => "\0",
      smoking   => "\0",
      parking   => "validated",
      rating    => 4,
      payment   => "MC VISA AMEX",
    },
#   { ... },  etc.
);

sub findRestaurants {
    my ($database, $query) = @_;
    return grep {
        $query->{city} ?
            lc($query->{city}) eq lc($_->{city}) : 1
        and $query->{cuisine} ?
            $_->{cuisine} =~ /$query->{cuisine}/i : 1
        and $query->{min_expense} ?
           $_->{expense} >= $query->{min_expense} : 1
        and $query->{max_expense} ?
           $_->{expense} <= $query->{max_expense} : 1
        and $query->{music} ? $_->{music} : 1
        and $query->{music_type} ?
           $_->{music} =~ /$query->{music_type}/i : 1
        and $query->{meals} ?
           $_->{meals} =~ /$query->{meals}/i : 1
        and $query->{view} ? $_->{view} : 1
        and $query->{smoking} ? $_->{smoking} : 1
        and $query->{parking} ? $_->{parking} : 1
        and $query->{min_rating} ?
           $_->{rating} >= $query->{min_rating} : 1
        and $query->{max_rating} ?
           $_->{rating} <= $query->{max_rating} : 1
        and $query->{payment} ?
           $_->{payment} =~ /$query->{payment}/i : 1
    } @$database;
}

%query = ( city => 'Seattle', cuisine => 'Asian|Thai' );
@restaurants = findRestaurants(\@database, \%query);
print "$restaurants[0]->{name}\n";

输出结果: Wild Ginger
列出当前目录中的文本文件

@files = grep { -f and -T } glob '* .*';
print "@files\n";
glob 函数是独立于操作系统的,它像 Unix 的 shell 一样对文件的扩展名进行匹配。单个的 * 表示匹配所以当前目录下不以 . 开头的文件, .* 表示匹配当前目录下以 . 开头的所有文件。 -f 和 -T 文件测试符分别用来测试纯文件和文本文件,是的话则返回真。使用 -f and -T 进行测试比单用 -T 进行测试有效,因为如果一个文件没有通过 -f 测试,那么相比 -f 更耗时的 -T 测试就不会进行。
从数组中选出非重复元素

@array = qw(To be or not to be that is the question);
print "@array\n";
@found_words =
    grep { $_ =~ /b|o/i and ++$counts{$_} < 2; } @array;
print "@found_words\n";

输出结果:
To be or not to be that is the question
To be or not to question逻辑表达式 $_ =~ /b|o/i 匹配包含有 b 或 o 的元素(不区别大小写)。在这个例子里把匹配操作放在累加前比反过来做有效些。比如,如果左边的表达式是假的,那么右边的表达式子就不会被计算。

选出二维坐标数组中横坐标大于纵坐标的元素

# An array of references to anonymous arrays
@data_points = ( [ 5, 12 ], [ 20, -3 ],
                 [ 2, 2 ], [ 13, 20 ] );
@y_gt_x = grep { $_->[0] < $_->[1] } @data_points;
foreach $xy (@y_gt_x) { print "$xy->[0], $xy->[1]\n" }

输出结果:
5, 12
13, 20在一个简单数据库中查找餐馆

这个例子里的数据库实现方法不是实际应用中该使用的,但是它说明了使用 grep 函数的时候,只要你的内存够用, BLOCK 块的复杂度基本没有限制。


# @database is array of references to anonymous hashes
@database = (
    { name      => "Wild Ginger",
      city      => "Seattle",
      cuisine   => "Asian Thai Chinese Korean Japanese",
      expense   => 4,
      music     => "\0",
      meals     => "lunch dinner",
      view      => "\0",
      smoking   => "\0",
      parking   => "validated",
      rating    => 4,
      payment   => "MC VISA AMEX",
    },
#   { ... },  etc.
);

sub findRestaurants {
    my ($database, $query) = @_;
    return grep {
        $query->{city} ?
            lc($query->{city}) eq lc($_->{city}) : 1
        and $query->{cuisine} ?
            $_->{cuisine} =~ /$query->{cuisine}/i : 1
        and $query->{min_expense} ?
           $_->{expense} >= $query->{min_expense} : 1
        and $query->{max_expense} ?
           $_->{expense} <= $query->{max_expense} : 1
        and $query->{music} ? $_->{music} : 1
        and $query->{music_type} ?
           $_->{music} =~ /$query->{music_type}/i : 1
        and $query->{meals} ?
           $_->{meals} =~ /$query->{meals}/i : 1
        and $query->{view} ? $_->{view} : 1
        and $query->{smoking} ? $_->{smoking} : 1
        and $query->{parking} ? $_->{parking} : 1
        and $query->{min_rating} ?
           $_->{rating} >= $query->{min_rating} : 1
        and $query->{max_rating} ?
           $_->{rating} <= $query->{max_rating} : 1
        and $query->{payment} ?
           $_->{payment} =~ /$query->{payment}/i : 1
    } @$database;
}

%query = ( city => 'Seattle', cuisine => 'Asian|Thai' );
@restaurants = findRestaurants(\@database, \%query);
print "$restaurants[0]->{name}\n";

输出结果: Wild Ginger
页: [1]
查看完整版本: grep函数介绍