Re: extract Occurrences AFTER ... and before "-30-"

Matijn Woudt <tijnema@xxxxxxxxx> · Sun, 2 Sep 2012 17:45:55 +0200

On Sun, Sep 2, 2012 at 4:36 PM, John Taylor-Johnston
<jt.johnston@xxxxxxxxxxxxxx> wrote:
>
>> On Sun, Sep 2, 2012 at 6:23 AM, John Taylor-Johnston
>> <jt.johnston@xxxxxxxxxxxxxx> wrote:
>>>
>>> See:
>>> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.php
>>> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.phps
>>>
>>> In $mystring, I need to extract everything between "|News Releases|" and
>>> "-30".
>>>
>>> The thing now is $mystring might contain many instances of "|News
>>> Releases|"
>>> and "-30".
>>>
>>> How do I deal with this? My code only catches the first instance.
>>>
>>> Thanks for you help so far.
>>>
>>> John
>>>
>> You could use substr to retrieve the rest of the string and just start
>> over (do it in a while loop to catch all).
>> Though, it's probably not really efficient if you have long strings.
>> You'd be better off with preg_match. You can do it all with a single
>> line of code, albeit that regex takes quite some time to figure out if
>> not experienced.
>>
>> - Matijn
>>
>> PS. Please don't top post on this and probably any mailing list.
>
> Matijn, I'm a habitual top quoter. Horrible :)) But bottom quoting is not
> intuitive. But the are the rules, so I will be a good poster :))

I do find it intuitive actually, when reading things back your answer
is after the question, which makes sense. The other way around
doesn't?

>
> I will have very, very long strings. It will be a corpus of text, of maybe
> 1-2 megs of text.
>
> I'm not terribly experienced. How would I "while" loop this?
>
> I am reading preg-match and the examples, but I don't really follow.
> http://www.php.net/manual/en/function.preg-match.php
>
> I admit, I don't know what |"/php/i"means.|
>

Well, it finds any form of the word php in a text, the i means it can
also be pHp or PHP, etc. It's not that useful in that way. But that
brings me to Frank's example, which is in the right direction.

On Sun, Sep 2, 2012 at 4:33 PM, Frank Arensmeier <farensmeier@xxxxxxxxx> wrote:
> My approach would be to split the hole text into smaller chunks (with e.g. explode()) and extract the interesting parts with a regular expression. Maybe this will give you some ideas:
>
> $chunks = explode("-30-", $mystring);
> foreach($chunks as $chunk) {
>         preg_match_all("/News Releases\n(.+)/s", $chunk, $matches);
>         var_dump($matches[1]);
> }
>
> The regex matches all text between "News Releases" and the end of the chunk.

It shouldn't be needed to explode the string first, you could do that
with a single preg_match_all. (Sorry, can't remember how anymore, it's
been a while since I last used PCRE ).

On Sun, Sep 2, 2012 at 4:52 PM, John Taylor-Johnston
<jt.johnston@xxxxxxxxxxxxxx> wrote:
> I could live with that, I think. Here is the output:
> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test2.php
>
> Here are the newbie questions
>
> Why is there more than one array?

One for each preg_match_all in the loop.
>
> array(1)
>
> What are string(190) and string string(247)? Why are they named like that?

string(190), array(1). That's just var_dump. You won't see them if you
used echo or print_r etc. Have a look at the var_dump manual page to
learn more.

> http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test2.php
> Please explain the / (.+)/s in "/News Releases\n(.+)/s"?

The explode will split each block at "-30-", now we have blocks that
end just before the "-30-" sign. (.+) means match all until.. the end.
/s means that the (.+)  also includes newlines.
>
> My question is:
> Is one array not better? (My next step will be to parse the array to find
> the frequency of each word ... an array.)
>

Sure, you just need to figure out the PCRE. Find out more about PCRE
syntax on google (it's pretty much the same as for Perl and other
languages) and PHP manaul.

- Matijn

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php