Re: big table / hadoop / map reduce

Artur Ejsmont <ejsmont.artur@xxxxxxxxx> · Sat, 30 Oct 2010 19:14:49 +0100

yeah i think that would make sense.

if you find more good examples from different areas let me know ... i
think i get the basic idea ... will try to apply it some time :)

cheers :)

art

On 30 October 2010 18:58, Andrés G. Montañez <andresmontanez@xxxxxxxxx> wrote:
> Hi Artur,
> in your IPs examples, lets supouse you have ten access log files (from
> ten different servers),
> there you already have the mapping part done.
>
> Then you reduce each log into anonther new file, indicating the IP
> address and the times it's repeated.
> At this stage you have a reduced version of each log file; then you
> need to map them into a new unique file,
> this file will be the merge of all the reduced versions of the log files.
> This this unique file, you will need to reduce it again, and there you
> will have an unique file with all the
> IPs address and the times they appear.
>
> There is no limit on the times you can call map and reduce.
>
> Cheers.
>
> On 30 October 2010 15:51, Artur Ejsmont <ejsmont.artur@xxxxxxxxx> wrote:
>> sure that was a bit more helpful, thanks :)
>>
>> i was still wondering to what other use cases would that apply. This
>> is a good article (best so far i guess):
>> http://code.google.com/edu/parallel/mapreduce-tutorial.html
>>
>> The thing is that reduce has to aggregate data or it would be
>> impractical. So i am trying to see more examples to fully understand
>> the limitations of the method.
>>
>> Lets say i want to find top 10 IP addresses in an access log:
>> - split log into small files
>> - i take one fragment (one file)
>> - worker maps to a list of <ip, 1>
>> - before reduce is called data is sorted by ip
>> - reduce makes <ip, totalCountPerLogFileSample>
>>
>> so i have a bunch of files with aggregated lists of <IP,
>> totalCountPerFile>. But then would it not have to be merged across all
>> results again? with another sort/reduce call? or to avoid that do i
>> need initial data to be already clustered so one ip appears only in
>> one chunk file?
>>
>> Does it make sense?
>>
>> As i said i am still trying to figure out how should it be applied and
>> when ... also how to transform problems to make it still work : )
>>
>> I want to write some simple map reduce like the one above just to see
>> it working and play around a bit :)
>>
>> cheers
>>
>> Art
>>
>> On 22 October 2010 16:49, Andrés G. Montañez <andresmontanez@xxxxxxxxx> wrote:
>>> Imagine you have to get track of some kind of traffic, for example,
>>> "ad impressions";
>>> lets supose that you have millions of those hits; you will have to
>>> have a few servers to
>>> receive the notifications of the impression of an ad.
>>>
>>> After the end of the day, you will have that info across a bunch of
>>> servers; mostly you will have
>>> a record of each impression indicating the Identifier (id) of the Ad.
>>>
>>> To this info to become useful, you will have to agregate it; for
>>> example to know which is the Ad with most impressions.
>>> You will have to iterate over all servers and MAP the info into one
>>> place; now that you have all the info,
>>> you will have to REDUCE it; so you will have one record per Ad
>>> identifier indicating the TOTAL impressions of that day.
>>>
>>> That's the basic idea. It's like aftermath of "Divide and Conquer".
>>>
>>> Hope this will be useful.
>>>
>>> Cheers.
>>>
>>> On 22 October 2010 13:27, Artur Ejsmont <ejsmont.artur@xxxxxxxxx> wrote:
>>>> hehe .... sorry but this does not help :-) i can google for wikipedia
>>>> definitions.
>>>>
>>>> I was hoping for some really good articles/examples that would put it
>>>> into enough context. I would like to have good idea when it could be
>>>> useful.
>>>>
>>>> So far had no luck with that. Its like with design patterns ... people
>>>> who dont understand them should not write articles trying to explain
>>>> them to others :P
>>>>
>>>> Art
>>>>
>>>> On 22 October 2010 15:29, Andrés G. Montañez <andresmontanez@xxxxxxxxx> wrote:
>>>>> Hi Artur,
>>>>>
>>>>> Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapReduce
>>>>>
>>>>> And here are the native implementations in php:
>>>>> http://www.php.net/manual/en/function.array-map.php
>>>>> http://www.php.net/manual/en/function.array-reduce.php
>>>>>
>>>>> The basic idea is to gather a lot of data, from several nodes, and
>>>>> "map" them togheter;
>>>>> then, assuming a lot of this data is repeated across the dataset, we
>>>>> "reduce" them.
>>>>>
>>>>>
>>>>> Cheers.
>>>>>
>>>>> On 22 October 2010 12:14, Artur Ejsmont <ejsmont.artur@xxxxxxxxx> wrote:
>>>>>> Hi there guys and girls
>>>>>>
>>>>>> Have anyone came across any reasonable explanation / articles on how
>>>>>> hadoop and map reduce work in practice?
>>>>>>
>>>>>> i have read a few articles now and then and i must say i am puzzled
>>>>>> .... am i stupid or they just cant find an easy way to explain it? :P
>>>>>>
>>>>>> What i would hope for is explanation on simple example of application
>>>>>> with some code samples preferably.
>>>>>>
>>>>>> anyone good at it here?
>>>>>>
>>>>>> cheers
>>>>>>
>>>>>> --
>>>>>> PHP Database Mailing List (http://www.php.net/)
>>>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Andrés G. Montañez
>>>>> Zend Certified Engineer
>>>>> Montevideo - Uruguay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Visit me at:
>>>> http://artur.ejsmont.org/blog/
>>>>
>>>
>>>
>>>
>>> --
>>> Andrés G. Montañez
>>> Zend Certified Engineer
>>> Montevideo - Uruguay
>>>
>>
>>
>>
>> --
>> Visit me at:
>> http://artur.ejsmont.org/blog/
>>
>
>
>
> --
> Andrés G. Montañez
> Zend Certified Engineer
> Montevideo - Uruguay
>

-- 
Visit me at:
http://artur.ejsmont.org/blog/

-- 
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php