Re: File Write Operation Slows to a Crawl....

Shawn McKenzie <nospam@xxxxxxxxxxxxx> · Mon, 23 Feb 2009 14:55:43 -0600



Shawn McKenzie wrote:
> fschnittke@xxxxxxxxxxxxx wrote:
>> Hi:
>>
>> Newbie here. This is my first attempt at PHP scripting. I'm trying to find
>> an alternative to Lotus Domino's domlog.nsf for logging web transactions.
>> Domino does create an Apache compatible text file of the web transactions,
>> and this is what I’m trying to parse. I started off using a code snibbet I
>> found on the web. I modified it a little bit to suit my needs. It was
>> working fine with the small 600k test log file I was using, but since I’ve
>> moved to the larger 18Mb production log file here’s what happens:
>>
>> I’ve modified the code and added an echo statement to echo each loop that
>> gets processed. Initially it starts off very fast but then performance
>> becomes very slow, to a point where I can count each loop as it’s being
>> processed. It’s taking a little over 3 hours to parse the entire file. I
>> figured it was a disk cache thing, so I created a ram drive. This has
>> improved the performance, but is still taking an hour to parse.
>>
>> Here is the PHP script I’m using:
>>
>>
>> <?php
>>
> Why read in an array and then implode it to a string, then split it into
> an array?  Just use file_get_contents() and split it or use file() and
> then do your preg_replace("/(\r|\t)/", on the array).
>> $ac_arr = file('access_log');
>> $astring = join("", $ac_arr);
>> $astring = preg_replace("/(\r|\t)/", "", $astring);
>> $records = preg_split("/(\n)/", $astring, -1, PREG_SPLIT_NO_EMPTY);
>>
>> $sizerecs = sizeof($records);
>>
>> // now split into records
>> $i = 1;
>> $each_rec = 0;
>>
> Why not foreach($records as $all) ?
>> while($i<$sizerecs) {
>> $all = $records[$i];
>>
> All of these $all = str_replace() and othe str_replace() are probably
> killing you.  Rethink a way where you extract the data instead of
> finding it and then replacing it.
>> // IP Address ($IP):
>> $IP = substr($all, 0, strpos($all, " "));
>> $all = str_replace($IP, "", $all);
>>
>> //Remote User ($RU):
>> $string = substr($all, 0, strpos($all, " [")); // www.vpcl.on.ca T123
>> $sstring = substr($string, strpos($string, " ")+1);
>> $AUstring = substr($sstring, strpos($sstring, " "));
>> $RU = preg_replace("/\"/", "", $AUstring);
>> $RU = trim($RU);
>> $all = str_replace($string, "", $all);
>>
>> //Request Time Stamp ($RTS):
>> preg_match("/\[(.+)\]/", $all, $match);
>> $RTS = $match[1];
>> $all = str_replace(" [$RTS] \"", "", $all);
>>
>> //Http Request Line ($HRL):
>> $string = substr($all, 0, strpos($all, "\"")+2);
>> $HRL = str_replace("\"", "", $string);
>> $all = str_replace($string, "", $all);
>>
>> //Http Response Status Code (HRSC):
>> $HRSC = trim(substr($all, 0, strpos($all, " ")+1));
>> $all = str_replace($HRSC, "", $all);
>>
>> //Request Content Length (RCL):
>> $string = substr($all, 0, strpos($all, "\"")+1);
>> $RCL = trim(str_replace("\"", "", $string));
>> $all = str_replace($string, "", $all);
>>
>> //Referring URL (RefU):
>> $string = substr($all, 0, strpos($all, "\"")+3);
>> $RefU = substr($all, 0, strpos($all, "\""));
>> $all = str_replace($string, "", $all);
>>
>> //User Agent (UA):
>> $string = substr($all, 0, strpos($all, "\"")+2);
>> $UA = substr($all, 0, strpos($all, "\""));
>> $all = str_replace($string, "", $all);
>>
>> //Time to Process Request:
>>
>> #$new_format[$each_rec] = "$UA\n";
>> $new_format[$each_rec] =
>> "$IP\t$RU\t$RTS\t$HRL\t$HRSC\t$RCL\t$RefU\t$UA\t$all\n";
>>
> Each time through the above loop you add a $new_format[$each_rec] and
> then here you are looping through each one of those.  I think if you
> just move this to the end it will make a drastic improvement.
>> $fhandle = fopen("/ramdrive/import_file.txt", "w");
>>   foreach($new_format as $data) {
>>     fputs($fhandle, "$data");
>>     }
>>   fclose($fhandle);
>>
>> // advance to next record
>> echo "$i\n";
>> $i = $i + 1;
>>
>> $each_rec++;
>> }
> $fhandle = fopen("/ramdrive/import_file.txt", "w");
>    foreach($new_format as $data) {
>      fputs($fhandle, "$data");
>    }
> fclose($fhandle);
> 
>> ?>
>>
>>
>> This is running on a Toshiba Tecra A4 Laptop with FreeBSD 7.0 Release.
>> Plenty of RAM and HDD space. The PHP Version is:
>>
>> PHP 5.2.5 with Suhosin-Patch 0.9.6.2 (cli) (built: Feb 11 2009 09:28:47)
>> Copyright (c) 1997-2007 The PHP Group
>> Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
>>
>> What should I do to get this script to run faster?
>>
>> Any help is appreciated….
>>
>> Regards,
>>
>>
>>
>> Fred Schnittke
>>
>>
>> ----------------------------
>> Powered by Execulink Webmail
>> http://www.execulink.com/
>>
> 
> 
I see that Paul replied and we say the same things, so here are the two
best approaches to speed it up ignoring all the str_replaces(), etc.
that need to be gotten rid of:

// option 1 - i would assume this to be the most efficient / fastest
$ac_arr = file('access_log');
$records = preg_replace("/(\r|\t)/", "", $ac_arr);

$fhandle = fopen("/ramdrive/import_file.txt", "w");

foreach($records as $all) {
	// manipulate your data
	// if you actually need the array then assign it here
	// $new_format[] =
	// "$IP\t$RU\t$RTS\t$HRL\t$HRSC\t$RCL\t$RefU\t$UA\t$all\n"
	// else just do this
	fputs($fhandle,
        "$IP\t$RU\t$RTS\t$HRL\t$HRSC\t$RCL\t$RefU\t$UA\t$all\n");
}
fclose($fhandle);


// option 2 - if you don't need the array, just create a large string
$ac_arr = file('access_log');
$records = preg_replace("/(\r|\t)/", "", $ac_arr);

foreach($records as $data) {
	// manipulate your data

	$new_format .=
        "$IP\t$RU\t$RTS\t$HRL\t$HRSC\t$RCL\t$RefU\t$UA\t$all\n";
}
file_put_contents("/ramdrive/import_file.txt", $new_format);


-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php