Performance of Python, PHP and Perl
Had a 7GB text file that I needed to run some parsing on (to prepare for a DB import). As part of my habit I pulled out perl and whipped up a quick program to parse and generate some loadable files. While watching it run I got to thinking about … why … why perl (yes, I know habbits are hard to break). So while watching it run I re-wrote the program into PHP and Python.
Performance Numbers (on 5 million lines worth of the file)
1 $ time ./split.pl p.test # Perl 5.8.8
2
3 real 0m38.577s
4 user 0m33.554s
5 sys 0m0.848s
6
7 $ time ./split.py p.test # Python 2.4.4
8 real 0m44.895s
9 user 0m42.975s
10 sys 0m0.900s
11
12 $ time php split.php p.test # PHP 5.2.6RC4
13 real 1m10.887s
14 user 0m51.251s
15 sys 0m18.677s
So, it appears that Perl is the right choice for this job.. Though python is a good second choice, but PHP 50% slower (most likely due to not having complied regular expressions). I also might note that I’m not fond of the python if/else probably with a chained expression match, where I want to “side effect” out the results of the match — is there better syntax?
Here’s the code for you’re viewing pleasure and possible commentary.
perl
1use strict;
2
3my %first;
4
5open(FULL, ">full.txt");
6
7while (<>) {
8# __SINGLE_TOKEN__ adrianenamorado 1
9# __MULTI_TOKEN__ a aaron yalow 1
10 chop;
11 if (/^__MULTI_TOKEN__\s+(\S+)\s+(.*)\t?\s*(\d+)\s*$/) {
12 $first{$1} += $3;
13 print FULL $1," ", $2, "\t", $3, "\n";
14 } elsif (/^__SINGLE_TOKEN__\s+(\S+)\s*\t?\s*(\d+)\s*$/) {
15 $first{$1} += $2;
16 } else {
17 print "Unknown: ", $_, "\n";
18 }
19}
20
21close(FULL);
22
23open(FIRST, ">first.txt");
24while (my($k, $c) = each %first) {
25 print FIRST $k,"\t",$c,"\n";
26}
27close(FIRST);
python
1import sys, os, re
2
3first = dict()
4
5ofd = open("full.txt", 'w')
6
7mre = re.compile('^__MULTI_TOKEN__\s+(\S+)\s+(.*)\t?\s*(\d+)\s*$')
8sre = re.compile('^__SINGLE_TOKEN__\s+(\S+)\s*\t?\s*(\d+)\s*$')
9
10ifd = open(sys.argv[1], 'r')
11
12for line in ifd :
13 line = line.strip()
14 m = mre.match(line)
15 if m :
16 first[m.group(1)] = m.group(3)
17 print >> ofd, m.group(1), " ", m.group(2), "\t", m.group(3)
18 else :
19 m = sre.match(line)
20 if m :
21 first[m.group(1)] = m.group(2)
22 else :
23 print "Unknown ", line
24
25ofd.close();
26
27ofd = open("first.txt", 'w')
28for (k, c) in first.iteritems() :
29 print >> ofd, k, "\t", c
30ofd.close()
php
1$first = array();
2
3$fd = fopen("full.txt", 'w');
4$in = fopen($argv[1], 'r');
5
6while ($line = fgets($in)) {
7 $line = trim($line);
8 if (preg_match('/^__MULTI_TOKEN__\s+(\S+)\s+(.*)\t?\s*(\d+)\s*$/', $line, $m)) {
9 $first[$m[1]] += $m[3];
10 fprintf($fd, "%s %s\t%d\n", $m[1], $m[2], $m[3]);
11 } else if (preg_match('/^__SINGLE_TOKEN__\s+(\S+)\s*\t?\s*(\d+)\s*$/', $line, $m)) {
12 $first[$m[1]] += $m[2];
13 } else {
14 print "Unknown: {$line}\n";
15 }
16}
17
18fclose($fd);
19
20$fd = fopen("first.txt", 'w');
21foreach ($first as $k => $c) {
22 fprintf($fd, "%s\t%d\n", $k, $c);
23}
24fclose($fd);