Re: Greenplum MapReduce

Suvankar Roy <suvankar.roy@xxxxxxx> · Mon, 3 Aug 2009 15:06:49 +0530

Hi Richard,

I sincerely regret the inconvenience
caused.....

%YAML 1.1

---

VERSION: 1.0.0.1 

DATABASE: test_db1

USER: gpadmin

DEFINE: 

        - INPUT: #****** This the line which is causing
the error ******#

                NAME: doc

                TABLE: documents

        - INPUT:

                NAME: kw

                TABLE: keywords

        - MAP: 

                NAME:    
      doc_map 

                LANGUAGE:  
    python 

                FUNCTION:  
     |

   i = 0 

   terms = {}

   for term in data.lower().split(): 

           i = i + 1

           if term in terms: 

                   terms[term]
+= ','+str(i) 

           else: 

                   terms[term]
= str(i) 

   for term in terms: 

           yield([doc_id, term, terms[term]])

                OPTIMIZE: STRICT
IMMUTABLE 

                PARAMETERS: 

   - doc_id integer 

   - data text 

                RETURNS: 

   - doc_id integer 

   - term text 

   - positions text 

        - MAP: 

                NAME:   kw_map

                LANGUAGE:  
    python 

                FUNCTION:  
    | 

   i = 0 

   terms = {} 

   for term in keyword.lower().split(): 

           i = i + 1 

           if term in terms: 

                   terms[term]
+= ','+str(i) 

           else: 

                   terms[term]
= str(i) 

           yield([keyword_id, i, term, terms[term]])

                OPTIMIZE: STRICT
IMMUTABLE 

                PARAMETERS: 

   - keyword_id integer 

   - keyword text 

                RETURNS: 

   - keyword_id integer 

   - nterms integer 

   - term text 

   - positions text 

        - TASK: 

                NAME: doc_prep

                SOURCE: doc 

                MAP: doc_map

        - TASK: 

                NAME: kw_prep 

                SOURCE: kw 

                MAP: kw_map 

        - INPUT: 

                NAME: term_join

                QUERY: | 

   SELECT doc.doc_id, kw.keyword_id, kw.term, 

kw.nterms, 

           doc.positions as doc_positions,

           kw.positions as kw_positions 

    FROM doc_prep doc INNER JOIN kw_prep kw ON 

(doc.term = kw.term)

        - REDUCE: 

                NAME: term_reducer

                TRANSITION: term_transition

                FINALIZE: term_finalizer

        - TRANSITION: 

                NAME: term_transition

                LANGUAGE: python

                PARAMETERS: 

   - state text 

   - term text 

   - nterms integer 

   - doc_positions text 

   - kw_positions text 

                FUNCTION: | 

   if state: 

           kw_split = state.split(':') 

   else: 

           kw_split = [] 

           for i in range(0,nterms): 

                   kw_split.append('')

   for kw_p in kw_positions.split(','): 

           kw_split[int(kw_p)-1] = doc_positions

   outstate = kw_split[0] 

   for s in kw_split[1:]: 

           outstate = outstate + ':' + s

   return outstate 

        - FINALIZE: 

                NAME: term_finalizer

                LANGUAGE: python

                RETURNS: 

   - count integer 

                MODE: MULTI 

                FUNCTION: | 

   if not state: 

           return 0 

   kw_split = state.split(':') 

   previous = None 

   for i in range(0,len(kw_split)): 

           isplit = kw_split[i].split(',')

           if any(map(lambda(x): x == '',
isplit)): 

                   return
0 

           adjusted = set(map(lambda(x):
int(x)-i, 

isplit)) 

           if (previous): 

                   previous
= 

adjusted.intersection(previous) 

           else: 

                   previous
= adjusted 

   if previous: 

           return len(previous) 

   return 0

        - TASK: 

                NAME: term_match

                SOURCE: term_join

                REDUCE: term_reducer

        - INPUT: 

                NAME: final_output

                QUERY: | 

   SELECT doc.*, kw.*, tm.count 

   FROM documents doc, keywords kw, term_match tm 

   WHERE doc.doc_id = tm.doc_id 

     AND kw.keyword_id = tm.keyword_id 

     AND tm.count > 0 

        EXECUTE: 

                - RUN: 

   SOURCE: final_output 

   TARGET: STDOUT

I have learnt that unnecessary TABs
can the cause of this, so trying to overcome that, hopefully the problem
will subside then....

Regards,

Suvankar Roy

Richard Huxton <dev@xxxxxxxxxxxx>

08/03/2009 02:55 PM

To
Suvankar Roy <suvankar.roy@xxxxxxx>

cc
pgsql-performance@xxxxxxxxxxxxxx

Subject
Re:  Greenplum MapReduce

Suvankar Roy wrote:

> Hi all,

> 

> Has anybody worked on Greenplum MapReduce programming ?

> 

> I am facing a problem while trying to execute the below Greenplum

> Mapreduce program written in YAML (in blue). 

The other poster suggested contacting Greenplum and I can only agree.

> The error is thrown in the 7th line as:

> Error: YAML syntax error - found character that cannot start any token

> while scanning for the next token, at line 7 (in red)

There is no red, particularly if viewing messages as plain text (which

most people do on mailing lists). Consider indicating a line some other

way next time (commonly below the line you put something like "this
is 

line 7 ^^^^^")

The most common problem I get with YAML files though is when a tab is 

accidentally inserted instead of spaces at the start of a line.

-- 

   Richard Huxton

   Archonet Ltd

ForwardSourceID:NT000058E2

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you