On Mon, Feb 23, 2009 at 11:59:20AM -0500, fschnittke@xxxxxxxxxxxxx wrote: > Hi: > > Newbie here. This is my first attempt at PHP scripting. I'm trying to find > an alternative to Lotus Domino's domlog.nsf for logging web transactions. > Domino does create an Apache compatible text file of the web transactions, > and this is what I?m trying to parse. I started off using a code snibbet I > found on the web. I modified it a little bit to suit my needs. It was > working fine with the small 600k test log file I was using, but since I?ve > moved to the larger 18Mb production log file here?s what happens: > > I?ve modified the code and added an echo statement to echo each loop that > gets processed. Initially it starts off very fast but then performance > becomes very slow, to a point where I can count each loop as it?s being > processed. It?s taking a little over 3 hours to parse the entire file. I > figured it was a disk cache thing, so I created a ram drive. This has > improved the performance, but is still taking an hour to parse. > > Here is the PHP script I?m using: > > > <?php > > $ac_arr = file('access_log'); > $astring = join("", $ac_arr); First, don't use file() (which reads a file into an array of strings), and then join() (which makes them into a single string). Use file_get_contents(), which does what you want all in one step. The result is a single string. > $astring = preg_replace("/(\r|\t)/", "", $astring); Use two calls to str_replace() for the above. It's likely faster because it doesn't involve regular expressions iterated over an 18M file. > $records = preg_split("/(\n)/", $astring, -1, PREG_SPLIT_NO_EMPTY); It looks like you're trying to split the string at newlines. So you end up with an array of strings again. If that's the case, just stick with the file() call earlier, and then remove the \r's and \t's. > > $sizerecs = sizeof($records); > > // now split into records > $i = 1; > $each_rec = 0; > > while($i<$sizerecs) { > $all = $records[$i]; > > // IP Address ($IP): > $IP = substr($all, 0, strpos($all, " ")); > $all = str_replace($IP, "", $all); > > //Remote User ($RU): > $string = substr($all, 0, strpos($all, " [")); // www.vpcl.on.ca T123 > $sstring = substr($string, strpos($string, " ")+1); > $AUstring = substr($sstring, strpos($sstring, " ")); > $RU = preg_replace("/\"/", "", $AUstring); > $RU = trim($RU); > $all = str_replace($string, "", $all); > > //Request Time Stamp ($RTS): > preg_match("/\[(.+)\]/", $all, $match); > $RTS = $match[1]; > $all = str_replace(" [$RTS] \"", "", $all); > > //Http Request Line ($HRL): > $string = substr($all, 0, strpos($all, "\"")+2); > $HRL = str_replace("\"", "", $string); > $all = str_replace($string, "", $all); > > //Http Response Status Code (HRSC): > $HRSC = trim(substr($all, 0, strpos($all, " ")+1)); > $all = str_replace($HRSC, "", $all); > > //Request Content Length (RCL): > $string = substr($all, 0, strpos($all, "\"")+1); > $RCL = trim(str_replace("\"", "", $string)); > $all = str_replace($string, "", $all); > > //Referring URL (RefU): > $string = substr($all, 0, strpos($all, "\"")+3); > $RefU = substr($all, 0, strpos($all, "\"")); > $all = str_replace($string, "", $all); > > //User Agent (UA): > $string = substr($all, 0, strpos($all, "\"")+2); > $UA = substr($all, 0, strpos($all, "\"")); > $all = str_replace($string, "", $all); > > //Time to Process Request: > > #$new_format[$each_rec] = "$UA\n"; > $new_format[$each_rec] = > "$IP\t$RU\t$RTS\t$HRL\t$HRSC\t$RCL\t$RefU\t$UA\t$all\n"; I would do various cleanups in the above code, but the following code disturbs me. If I'm not misreading, you're reopening this output file with every iteration through your file. That is, for each line you process, you're opening and writing one line to the file, then closing the file again. If you're relatively certain no one else will be writing to this file, simply open it once before your while() loop, and write a single line to it at this point. Also, you're using a foreach loop here, but if I'm reading this properly, you only have one record you're writing, so there's no need for a loop. In fact, you're also storing the results in a progressively larger and larger array ($new_format). Since you're not using this array for any other purpose, there's no point in using an array. Just use a string to store it, and then write it back out. So instead of what you're doing below, and considering that $new_format will be a string instead of an array, just do: fputs($fhandle, $new_format); Close the file after the end of your loop. I must be reading this wrong, because it looks to me like you're actually writing n+1 lines to the file for each iteration (n). I don't think that's what you had in mind. That's why you don't need a loop here. If what I'm saying isn't clear, consider this: The way this is written, $new_format is an array that grows by one value (string) each iteration. But with each iteration, you're writing out the *whole* array to the file. The first time through, you write one line. The second time through, you're writing the first line again, and then the second. Third time through, you're writing the first line again, then the second again, then the third. And so on. Now, if you stay with your exising code, you could move your loop end to *above* this point, and simply perform your file-open/loop/file-close operation once, after the loop has terminated. That's probably faster than stopping inside each iteration and writing a line. > > $fhandle = fopen("/ramdrive/import_file.txt", "w"); > foreach($new_format as $data) { > fputs($fhandle, "$data"); > } > fclose($fhandle); > > // advance to next record > echo "$i\n"; > $i = $i + 1; > > $each_rec++; > } > ?> > > > This is running on a Toshiba Tecra A4 Laptop with FreeBSD 7.0 Release. > Plenty of RAM and HDD space. The PHP Version is: > > PHP 5.2.5 with Suhosin-Patch 0.9.6.2 (cli) (built: Feb 11 2009 09:28:47) > Copyright (c) 1997-2007 The PHP Group > Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies > > What should I do to get this script to run faster? > > Any help is appreciated?. > > Regards, > > > > Fred Schnittke > Hope I read things correctly and that this helps. Sorry if my analysis is off. Paul -- Paul M. Foster -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php