Re: Removing UTF-8 from text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/01/07, Richard Lynch <ceo@xxxxxxxxx> wrote:
On Wed, January 3, 2007 2:41 pm, Dotan Cohen wrote:
> On 03/01/07, Richard Lynch <ceo@xxxxxxxxx> wrote:
>> Instead of trying to strip the UTF stuff out, try to capture the
>> part
>> you want:
>>
>> preg_match_all('|<[^>]>|ms', $emails, $output);
>> var_dump($output);
>>
>
> Richard, I do have a working script now, but I'm intrigued by your
> regex. Why do you surround the needle with pipes, and what is the "ms"
> for?

The start/end character can be almost anything you want, and which is
convenient.

If the "pattern" you are looking for has a '|' in it, then '|' would
be very inconvenient, as you'd have to escape it.

But if it has no '|' in the pattern, '|' is convenient.

It's traditional to use '/' but because / is already used in pathnames
and HTML tags, I find myself using '|' more often, as I seldom have
patterns with '|' in them as a meaningful character that I need to
type.

You can also (in some versions) use "matching" start/end delimiters,
like < with > or { and } and so on.

In this particular case, almost anything except < and > would be
convenient, so I could have chosen any of these:
|(<[^>]*>)|
/(<[^>]*>)/
{(<[^>]*>)}

[aside]
Notice how I subtly corrected my obvious mistakes this time around... :-)
[/aside]

The 'm' tacked on at the end allow for newline within the pattern and
content, so that if your emails are separated by newlines, it should
still work.

Actually, I think the 'm' might not be needed, as there should be any
newlines WITHIN the pattern.

The 's' allows the '.' (if I had one, which I don't) to match newlines
within the string as well as other characters.  It is totally
pointless to have included 's' in this case, since I have no '.' in
the pattern in the first place.  Just habit, I guess.

I generally find that if I have a big ol' chunk of text, and I want to
do PCRE on it, and it might have newlines, I want 'ms' on the end, and
I don't want that if it's just a single line of text.

I'm still definitely more in the Cargo Cult, perhaps graduating to
Voodoo Programming style, of PCRE pattern composing.  Maybe someday
I'll *really* understand regex, and graduate to Competent.  I doubt it
though.


Thanks. This is getting filed under my regex-emergencies label. I'll
definetly be referencing this again. As usual, I do prefer to be
taught to capture fish rather than be handed a fish.

Dotan Cohen

http://lyricslist.com/lyrics/artist_albums/517/yaz.html
http://what-is-what.com/what_is/sitepoint.html

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux