Re: Seemingly weird regex problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tim Boring wrote:
Hello!  I'm having an odd regex problem.  Here's a summary of what I'm
trying to accomplish:

I've got a report file generated from our business management system
(Progress 4GL), one fixed-width record per line. I've got a php script
that reads in the raw file one line at a time, and "strips" out any
"unwanted" lines (repeated column headings, mostly).


I'm stripping out unwanted lines by looking at the beginning of each
line and doing the following:
1. If the line begins with a non-word character (\W+), discard it;
2. If the line begins with the word "Vendor", discard it;
3. If the line begins with "Loc", discard it;
4. If the line begins with a dash, discard it;
5. Else keep the line and write it to an output file.

The way I've implemented this in code is via the code snippet below. The problem I'm encountering, however, is that any line that begins with
a word, such as "AKRN", is matching rule #1, thus discarding the line. This is not what I want, but I'm having difficulty spotting my mistake.


To try to help spot the issue, I put in the if(preg_match("/^\W+/",
$line)) logic, and the weird thing is that this logic isn't outputting
the line beginning with things like "AKRN", yet the same line is getting
caught in the switch statement and being discarded.

Any suggestions?

while (!feof($input_handle))
{
$line = fgets($input_handle);

\W is every NON-word character.

    if (preg_match("/^\W+/", $line))
    {

manual says: "preg_match() returns the number of times pattern matches. That will be either 0 times (no match) or 1 time because preg_match() will stop searching after the first match."


so $line will be a string.

echo "$line\n";
}

if the string is 0 bytes long the switch will equate to false and match the first false case expression.

    switch ($line)
    {

is $total_counter less than or equal to 5? if yes then this case runs. else...

        case ($total_counter <= 5):
        fwrite($output_handle, $line);
        $counter++;
        $total_counter++;
        break;
       // Rule #1: non-word character

if $line is empty the it typecasts to boolean as false.
if the regexp does not find match (which it wouldn't in the case of an the empty string) then preg_match returns false. so this case always runs when line is empty.


probably every $line will fire this case.

       case preg_match("/^\W+/", $line):
          array_push($tossed_lines, $line);
          echo "Rule #1 violation\n";
          $tossed_counter++;
          $total_counter++;
          break;
        // Rule #2: "Vendor" at beginning of line

none of the rest will fire if $line is empty.
this case should fire if $line is not empty and starts with 'Vendor' (case insensitive). non empty string and numeric 1 (return val from preg_match()) both equate to true.


yadda yadda yadda... I just played a little test on PHP5 regarding this little problem. its all a misunderstanding regarding automatic typecasting of strings I think:

$> php -r '

switch ("") {
   case 0:echo "hello\n";
}
switch ("yes") {
   case 1:echo "hello again\n";
}
switch ("") {
   case 1:echo "huh?\n";
}
switch ("yes") {
   case 0:echo "huh again?\n";
}

assert("" == 0);
assert("yes" == 1);
assert("" == 1);
assert("yes" == 0);

switch ("1yes") {
   case 1: echo "oh?\n";
}
switch ("0yes") {
   case 0: echo "geddit?\n";
}
'
hello
huh again?

Warning: assert(): Assertion failed in Command line code on line 17

Warning: assert(): Assertion failed in Command line code on line 18
oh?
geddit?
PHP 5.0.2 (cli) (built: Oct 21 2004 13:52:27)




        case preg_match("/^Vendor/i", $line):
          array_push($tossed_lines, $line);
          echo "Rule #2 violation\n";
          $tossed_counter++;
          $total_counter++;
          break;
       // Rule #3: "Loc" at beginning of line
        case preg_match("/^Loc/i", $line):
          array_push($tossed_lines, $line);
          echo "Rule #3 violation\n";
          $tossed_counter++;
          $total_counter++;
          break;
       // Rule #4: dash character at beginning of line


I think the /^\W+/ above will always catch this case first..
change the order of the case statements?

        case preg_match("/^\-/", $line):
           array_push($tossed_lines, $line);
           echo "Rule #4 violation\n";
           $tossed_counter++;
           $total_counter++;
           break;
        default:
           fwrite($output_handle, $line);
           $counter++;
           $total_counter++;
           break;
       }
     }


-- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux