Re: File Write Operation Slows to a Crawl....

Paul M Foster <paulf@xxxxxxxxxxxxxxxxx> · Mon, 23 Feb 2009 15:27:33 -0500

On Mon, Feb 23, 2009 at 11:59:20AM -0500, fschnittke@xxxxxxxxxxxxx wrote:

> Hi:
> 
> Newbie here. This is my first attempt at PHP scripting. I'm trying to find
> an alternative to Lotus Domino's domlog.nsf for logging web transactions.
> Domino does create an Apache compatible text file of the web transactions,
> and this is what I?m trying to parse. I started off using a code snibbet I
> found on the web. I modified it a little bit to suit my needs. It was
> working fine with the small 600k test log file I was using, but since I?ve
> moved to the larger 18Mb production log file here?s what happens:
> 
> I?ve modified the code and added an echo statement to echo each loop that
> gets processed. Initially it starts off very fast but then performance
> becomes very slow, to a point where I can count each loop as it?s being
> processed. It?s taking a little over 3 hours to parse the entire file. I
> figured it was a disk cache thing, so I created a ram drive. This has
> improved the performance, but is still taking an hour to parse.
> 
> Here is the PHP script I?m using:
> 
> 
> <?php
> 
> $ac_arr = file('access_log');
> $astring = join("", $ac_arr);

First, don't use file() (which reads a file into an array of strings),
and then join() (which makes them into a single string). Use
file_get_contents(), which does what you want all in one step. The
result is a single string.

> $astring = preg_replace("/(\r|\t)/", "", $astring);

Use two calls to str_replace() for the above. It's likely faster because
it doesn't involve regular expressions iterated over an 18M file.

> $records = preg_split("/(\n)/", $astring, -1, PREG_SPLIT_NO_EMPTY);

It looks like you're trying to split the string at newlines. So you end
up with an array of strings again. If that's the case, just stick with
the file() call earlier, and then remove the \r's and \t's.

> 
> $sizerecs = sizeof($records);
> 
> // now split into records
> $i = 1;
> $each_rec = 0;
> 
> while($i<$sizerecs) {
> $all = $records[$i];
> 
> // IP Address ($IP):
> $IP = substr($all, 0, strpos($all, " "));
> $all = str_replace($IP, "", $all);
> 
> //Remote User ($RU):
> $string = substr($all, 0, strpos($all, " [")); // www.vpcl.on.ca T123
> $sstring = substr($string, strpos($string, " ")+1);
> $AUstring = substr($sstring, strpos($sstring, " "));
> $RU = preg_replace("/\"/", "", $AUstring);
> $RU = trim($RU);
> $all = str_replace($string, "", $all);
> 
> //Request Time Stamp ($RTS):
> preg_match("/\[(.+)\]/", $all, $match);
> $RTS = $match[1];
> $all = str_replace(" [$RTS] \"", "", $all);
> 
> //Http Request Line ($HRL):
> $string = substr($all, 0, strpos($all, "\"")+2);
> $HRL = str_replace("\"", "", $string);
> $all = str_replace($string, "", $all);
> 
> //Http Response Status Code (HRSC):
> $HRSC = trim(substr($all, 0, strpos($all, " ")+1));
> $all = str_replace($HRSC, "", $all);
> 
> //Request Content Length (RCL):
> $string = substr($all, 0, strpos($all, "\"")+1);
> $RCL = trim(str_replace("\"", "", $string));
> $all = str_replace($string, "", $all);
> 
> //Referring URL (RefU):
> $string = substr($all, 0, strpos($all, "\"")+3);
> $RefU = substr($all, 0, strpos($all, "\""));
> $all = str_replace($string, "", $all);
> 
> //User Agent (UA):
> $string = substr($all, 0, strpos($all, "\"")+2);
> $UA = substr($all, 0, strpos($all, "\""));
> $all = str_replace($string, "", $all);
> 
> //Time to Process Request:
> 
> #$new_format[$each_rec] = "$UA\n";
> $new_format[$each_rec] =
> "$IP\t$RU\t$RTS\t$HRL\t$HRSC\t$RCL\t$RefU\t$UA\t$all\n";

I would do various cleanups in the above code, but the following code
disturbs me. If I'm not misreading, you're reopening this output file
with every iteration through your file. That is, for each line you
process, you're opening and writing one line to the file, then closing
the file again. If you're relatively certain no one else will be writing
to this file, simply open it once before your while() loop, and write a
single line to it at this point. Also, you're using a foreach loop here,
but if I'm reading this properly, you only have one record you're
writing, so there's no need for a loop. In fact, you're also storing the
results in a progressively larger and larger array ($new_format). Since
you're not using this array for any other purpose, there's no point in
using an array. Just use a string to store it, and then write it back
out. So instead of what you're doing below, and considering that
$new_format will be a string instead of an array, just do:

fputs($fhandle, $new_format);

Close the file after the end of your loop.

I must be reading this wrong, because it looks to me like you're
actually writing n+1 lines to the file for each iteration (n). I don't
think that's what you had in mind. That's why you don't need a loop
here. If what I'm saying isn't clear, consider this: The way this is
written, $new_format is an array that grows by one value (string) each
iteration. But with each iteration, you're writing out the *whole* array
to the file. The first time through, you write one line. The second time
through, you're writing the first line again, and then the second. Third
time through, you're writing the first line again, then the second
again, then the third. And so on. Now, if you stay with your exising
code, you could move your loop end to *above* this point, and simply
perform your file-open/loop/file-close operation once, after the loop
has terminated. That's probably faster than stopping inside each
iteration and writing a line.

> 
> $fhandle = fopen("/ramdrive/import_file.txt", "w");
>   foreach($new_format as $data) {
>     fputs($fhandle, "$data");
>     }
>   fclose($fhandle);
> 
> // advance to next record
> echo "$i\n";
> $i = $i + 1;
> 
> $each_rec++;
> }
> ?>
> 
> 
> This is running on a Toshiba Tecra A4 Laptop with FreeBSD 7.0 Release.
> Plenty of RAM and HDD space. The PHP Version is:
> 
> PHP 5.2.5 with Suhosin-Patch 0.9.6.2 (cli) (built: Feb 11 2009 09:28:47)
> Copyright (c) 1997-2007 The PHP Group
> Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
> 
> What should I do to get this script to run faster?
> 
> Any help is appreciated?.
> 
> Regards,
> 
> 
> 
> Fred Schnittke
> 

Hope I read things correctly and that this helps. Sorry if my analysis
is off.

Paul

-- 
Paul M. Foster

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php