Re: filter_var using regex

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Thu, 05 May 2011 22:25:36 +0100

On Thu, 2011-05-05 at 13:39 -0600, Jason Gerfen wrote:

> On 05/04/2011 03:10 PM, Ashley Sheridan wrote:
> > On Wed, 2011-05-04 at 13:46 -0600, Jason Gerfen wrote:
> > 
> >> On 05/04/2011 01:27 PM, Ashley Sheridan wrote:
> >>> On Wed, 2011-05-04 at 13:20 -0600, Jason Gerfen wrote:
> >>>
> >>>> I am running into a problem using the REGEXP option with filter_var().
> >>>>
> >>>> The string I am using: 09VolunteerApplication.doc
> >>>> The PCRE regex I am using:
> >>>> /^[a-z0-9]\.[doc|pdf|txt|jpg|jpeg|png|docx|csv|xls]{1,4}$/Di
> >>>>
> >>>> The function in it's entirety:
> >>>> return (!filter_var('09VolunteerApplication.doc',
> >>>> FILTER_VALIDATE_REGEXP,
> >>>> array('options'=>array('regexp'=>'/^[a-z0-9]\.[doc|pdf|txt|jpg|jpeg|png|docx|csv|xls]{1,4}$/Di'))))
> >>>> ? false : true;
> >>>>
> >>>> Anyone have any insight into this?
> >>>>
> >>>
> >>>
> >>> You missed a + in your regex, at the moment you're only checking to see
> >>> if a file starts with a single a-z or number and then is followed by the
> >>> period. Then you're checking for oddly for one to four extensions in the
> >>> list, are you sure you want to do that? And the square brackets are used
> >>> to match characters, not strings, use the standard brackets to allow
> >>> from a choice of strings
> >>>
> >>> Try this:
> >>>
> >>> '/^[a-z0-9]+\.(doc|pdf|txt|jpg|jpeg|png|docx|csv|xls)$/Di'
> >>>
> >>> One other thing you should be aware of maybe, filenames won't always
> >>> consist of just the letters a-z and numbers 0-9, they may contain
> >>> accented or foreign letters, hyphens, spaces and a number of other
> >>> characters depending on the client machines OS. Windows allows very few
> >>> characters for example compared to the Unix-like OS's like MacOS and
> >>> Linux.
> >>>
> >>
> >> Both are valid PCRE regex's. However the rules regarding usage of
> >> parenthesis for an XOR string does not explain a similar regex being
> >> used with the filter_var() like so:
> >>
> >> return (filter_var('kc-1', FILTER_VALIDATE_REGEXP,
> >> array('options'=>array('regexp'=>'/^[kc\-1|kc\-color|gr\-1|fa\-1|un\-1|un\-color|ben\-1|bencolor|sage\-1|sr\-1|st\-1]{1,8}$/Di')))
> >> ? true : false;
> >>
> >> The above returns string(4) "kc-1"
> >>
> >> Another test using the following works similarly:
> >>
> >> return (filter_var('u0368839', FILTER_VALIDATE_REGEXP,
> >> array('options'=>array('regexp'=>'/^[gp|u|gx]{1,2}[\d+]{6,15}$/Di'))) ?
> >> true : false;
> >>
> >> The above returns string(8) "u0368839"
> >>
> >> And
> >> return (filter_var('u0368839', FILTER_VALIDATE_REGEXP,
> >> array('options'=>array('regexp'=>'/^[gp|u|gx]{1,2}[\d+]{6,15}$/Di'))) ?
> >> true : false;
> >>
> >> returns string(8) "gp123456"
> >>
> >> As you can see these three examples use the start [] as XOR conditionals
> >> for multiple strings as prefixes.
> >>
> >>
> >>
> > 
> > 
> > Not quite, you think they match correctly because that's all you're
> > testing for, and you're not looking for anything that might disprove
> > that. Using your last example, it will also match these strings:
> > 
> > gu0368839
> > xx0368839
> > p0368839
> > 
> > 
> > I tested your first regex with '09VolunteerApplication.doc' and it
> > doesn't work at all until you add in that plus after the basename match
> > part of the regex:
> > 
> > ^[a-z0-9]+\.[doc|pdf|txt|jpg|jpeg|png|docx|csv|xls]{1,4}$
> > 
> > However, your regex (with the plus) also matches these strings:
> > 
> > 09VolunteerApplication.docp
> > 09VolunteerApplication.docj
> > 09VolunteerApplication.doc|    <-- note it's matching the literal bar
> > character
> > 
> > Making the changes I suggested (^[a-z0-9]+\.(doc|pdf|txt|jpg|jpeg|png|
> > docx|csv|xls)$) means the regex works as you expect. Square brackets in
> > a regex match a range, not a literal string, and without any sort of
> > modifier, match only a single instance of that range. So in your
> > example, you're matching a 4 character extension containing any of the
> > following characters '|cdfgjlnopstx', and a basename containing only 1
> > character that is either an a-z or a number.
> > 
> 
> You are right, after a few other tests I stand corrected. My apologies.
> However according to the documentation for filter_var() and the PCRE
> regexp option if it returns false, which it is, this is indicating an
> error with the regex.
> 
> In addition to this I would like to point out that the same regex using
> the older preg_match() function works as it should while the character
> class following by the pattern (+) fails the validation portion of the
> regex.
> 
> print_r(var_dump(filter_var('09VolunteerApplication.doc',
> FILTER_VALIDATE_REGEXP,
> array('options'=>array('regexp'=>'/^[a-z0-9]+\.(doc|pdf|txt|jpg|jpeg|png|docx|csv|xls){1,4}$/Di')))));
> 
> returns false (invalid regex) when using the character matching class
> [a-z0-9]+ with the filter_var() function with the FILTER_VALIDATE_REGEXP
> option
> 
> print_r(var_dump(preg_match('/^[a-z0-9]+\.(doc|pdf|txt|jpg|jpeg|png|docx|csv|xls){1,4}$/i',
> '09VolunteerApplication.doc')));
> 
> return int(1) indicating a valid regex as well as a valid match.
> 
> I believe this should be reported as a bug but I appreciate your
> assistance and insights.
> 
> 

Remove the {1,4} bit, as you're looking for 4 extensions. It's a valid
regex sure, but not the regex to match what you're looking for.

Out of interest, why are you using a regex here? Is this filename coming
from a form upload element?

-- 
Thanks,
Ash
http://www.ashleysheridan.co.uk