Re: big table / hadoop / map reduce

AndrÃs G. MontaÃez <andresmontanez@xxxxxxxxx> · Sat, 30 Oct 2010 15:58:35 -0200

Hi Artur,
in your IPs examples, lets supouse you have ten access log files (from
ten different servers),
there you already have the mapping part done.

Then you reduce each log into anonther new file, indicating the IP
address and the times it's repeated.
At this stage you have a reduced version of each log file; then you
need to map them into a new unique file,
this file will be the merge of all the reduced versions of the log files.
This this unique file, you will need to reduce it again, and there you
will have an unique file with all the
IPs address and the times they appear.

There is no limit on the times you can call map and reduce.

Cheers.

On 30 October 2010 15:51, Artur Ejsmont <ejsmont.artur@xxxxxxxxx> wrote:
> sure that was a bit more helpful, thanks :)
>
> i was still wondering to what other use cases would that apply. This
> is a good article (best so far i guess):
> http://code.google.com/edu/parallel/mapreduce-tutorial.html
>
> The thing is that reduce has to aggregate data or it would be
> impractical. So i am trying to see more examples to fully understand
> the limitations of the method.
>
> Lets say i want to find top 10 IP addresses in an access log:
> - split log into small files
> - i take one fragment (one file)
> - worker maps to a list of <ip, 1>
> - before reduce is called data is sorted by ip
> - reduce makes <ip, totalCountPerLogFileSample>
>
> so i have a bunch of files with aggregated lists of <IP,
> totalCountPerFile>. But then would it not have to be merged across all
> results again? with another sort/reduce call? or to avoid that do i
> need initial data to be already clustered so one ip appears only in
> one chunk file?
>
> Does it make sense?
>
> As i said i am still trying to figure out how should it be applied and
> when ... also how to transform problems to make it still work : )
>
> I want to write some simple map reduce like the one above just to see
> it working and play around a bit :)
>
> cheers
>
> Art
>
> On 22 October 2010 16:49, AndrÃs G. MontaÃez <andresmontanez@xxxxxxxxx> wrote:
>> Imagine you have to get track of some kind of traffic, for example,
>> "ad impressions";
>> lets supose that you have millions of those hits; you will have to
>> have a few servers to
>> receive the notifications of the impression of an ad.
>>
>> After the end of the day, you will have that info across a bunch of
>> servers; mostly you will have
>> a record of each impression indicating the Identifier (id) of the Ad.
>>
>> To this info to become useful, you will have to agregate it; for
>> example to know which is the Ad with most impressions.
>> You will have to iterate over all servers and MAP the info into one
>> place; now that you have all the info,
>> you will have to REDUCE it; so you will have one record per Ad
>> identifier indicating the TOTAL impressions of that day.
>>
>> That's the basic idea. It's like aftermath of "Divide and Conquer".
>>
>> Hope this will be useful.
>>
>> Cheers.
>>
>> On 22 October 2010 13:27, Artur Ejsmont <ejsmont.artur@xxxxxxxxx> wrote:
>>> hehe .... sorry but this does not help :-) i can google for wikipedia
>>> definitions.
>>>
>>> I was hoping for some really good articles/examples that would put it
>>> into enough context. I would like to have good idea when it could be
>>> useful.
>>>
>>> So far had no luck with that. Its like with design patterns ... people
>>> who dont understand them should not write articles trying to explain
>>> them to others :P
>>>
>>> Art
>>>
>>> On 22 October 2010 15:29, AndrÃs G. MontaÃez <andresmontanez@xxxxxxxxx> wrote:
>>>> Hi Artur,
>>>>
>>>> Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapReduce
>>>>
>>>> And here are the native implementations in php:
>>>> http://www.php.net/manual/en/function.array-map.php
>>>> http://www.php.net/manual/en/function.array-reduce.php
>>>>
>>>> The basic idea is to gather a lot of data, from several nodes, and
>>>> "map" them togheter;
>>>> then, assuming a lot of this data is repeated across the dataset, we
>>>> "reduce" them.
>>>>
>>>>
>>>> Cheers.
>>>>
>>>> On 22 October 2010 12:14, Artur Ejsmont <ejsmont.artur@xxxxxxxxx> wrote:
>>>>> Hi there guys and girls
>>>>>
>>>>> Have anyone came across any reasonable explanation / articles on how
>>>>> hadoop and map reduce work in practice?
>>>>>
>>>>> i have read a few articles now and then and i must say i am puzzled
>>>>> .... am i stupid or they just cant find an easy way to explain it? :P
>>>>>
>>>>> What i would hope for is explanation on simple example of application
>>>>> with some code samples preferably.
>>>>>
>>>>> anyone good at it here?
>>>>>
>>>>> cheers
>>>>>
>>>>> --
>>>>> PHP Database Mailing List (http://www.php.net/)
>>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> AndrÃs G. MontaÃez
>>>> Zend Certified Engineer
>>>> Montevideo - Uruguay
>>>>
>>>
>>>
>>>
>>> --
>>> Visit me at:
>>> http://artur.ejsmont.org/blog/
>>>
>>
>>
>>
>> --
>> AndrÃs G. MontaÃez
>> Zend Certified Engineer
>> Montevideo - Uruguay
>>
>
>
>
> --
> Visit me at:
> http://artur.ejsmont.org/blog/
>

-- 
AndrÃs G. MontaÃez
Zend Certified Engineer
Montevideo - Uruguay

-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php