Re: Out of Memory and Configuration Problems (Big Computer)

Tom Wilcox <hungrytom@xxxxxxxxxxxxxx> · Mon, 31 May 2010 20:14:34 +0100

I am having difficulties. I have rerun my update that uses the python 
functions..

(1) UPDATE nlpg.match_data SET org = normalise(org);

And some other similar queries on neighbouring fields in the table. They 
have all now worked. Without any changes to the configuration. I have 
done one thing in an attempt to minimise the risk of memory leak 
normalise() I added "toks = None" to the end of the normalise() 
function. However this was done after query (1) succeeded on the rerun.

Why would I get inconsistent behaviour? Would it have anything to do 
with SQL Server running on the same machine (although not actually doing 
anything at the moment - just idle server running in background).

Tangent: Is there any way to increase the memory allocated to postgres 
by Windows using Job Objects?

Cheers,
Tom

On 29/05/2010 18:55, Bill Moran wrote:
On 5/28/10 8:43:48 PM, Tom Wilcox wrote:
I ran this query:

EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org;

And I got this result:

"Seq Scan on match_data  (cost=0.00..9762191.68 rows=32205168 width=206)
(actual time=76873.592..357450.519 rows=27777961 loops=1)"
"Total runtime: 8028212.367 ms"

That would seem to indicate that the problem is in your Python
functions.

Some ideas for next steps:
* Perhaps it's just a few rows that have data in them that the
  function has difficulty with.  Add some debugging/logging to
  the function and see if the row it bombs on has anything unusual
  in it (such as a very large text field)
* While large, that function is fairly simplistic.  You may want
  to consider rewriting it as an SQL function, which should be
  more efficient in any event.

On 28 May 2010 19:39, Tom Wilcox <hungrytom@xxxxxxxxxxxxxx
<mailto:hungrytom@xxxxxxxxxxxxxx>> wrote:

    Oops. Sorry about that.

    I am having this problem with multiple queries however I am
    confident that a fair number may involve the custom plpython
    "normalise" function which I have made myself. I didn't think it
    would be complicated enough to produce a memory problem.. here it 
is:

    -- Normalises common address words (i.e. 'Ground' maps to 'grd')
    CREATE OR REPLACE FUNCTION normalise(s text) RETURNS text AS $$
    ADDR_FIELD_DELIM = ' '

    # Returns distinct list without null or empty elements
    def distinct_str(list):
         seen = set()
         return [x for x in list if x not in seen and not seen.add(x)
    and x!=None and len(x)>0]

    # normalise common words in given address string
    def normalise(match_data):
         if match_data==None: return ''
         import re
         # Tokenise
         toks = distinct_str(re.split(r'\s', match_data.lower()))
         out = ''
         for tok in toks:
             ## full word replace
             if tok == 'house' : out += 'hse'+ADDR_FIELD_DELIM
             elif tok == 'ground' : out += 'grd'+ADDR_FIELD_DELIM
             elif tok == 'gnd' : out += 'grd'+ADDR_FIELD_DELIM
             elif tok == 'front' : out += 'fnt'+ADDR_FIELD_DELIM
             elif tok == 'floor' : out += 'flr'+ADDR_FIELD_DELIM
             elif tok == 'floors' : out += 'flr'+ADDR_FIELD_DELIM
             elif tok == 'flrs' : out += 'flr'+ADDR_FIELD_DELIM
             elif tok == 'fl' : out += 'flr'+ADDR_FIELD_DELIM
             elif tok == 'basement' : out += 'bst'+ADDR_FIELD_DELIM
             elif tok == 'subbasement' : out += 'sbst'+ADDR_FIELD_DELIM
             elif tok == 'bsmt' : out += 'bst'+ADDR_FIELD_DELIM
             elif tok == 'lbst' : out += 'lower bst'+ADDR_FIELD_DELIM
             elif tok == 'street' : out += 'st'+ADDR_FIELD_DELIM
             elif tok == 'road' : out += 'rd'+ADDR_FIELD_DELIM
             elif tok == 'lane' : out += 'ln'+ADDR_FIELD_DELIM
             elif tok == 'rooms' : out += 'rm'+ADDR_FIELD_DELIM
             elif tok == 'room' : out += 'rm'+ADDR_FIELD_DELIM
             elif tok == 'no' : pass
             elif tok == 'number' : pass
             elif tok == 'and' : out += '&'+ADDR_FIELD_DELIM
             elif tok == 'rear' : out += 'rr'+ADDR_FIELD_DELIM
             elif tok == 'part' : out += 'pt'+ADDR_FIELD_DELIM
             elif tok == 'south' : out += 's'+ADDR_FIELD_DELIM
             elif tok == 'sth' : out += 's'+ADDR_FIELD_DELIM
             elif tok == 'north' : out += 'n'+ADDR_FIELD_DELIM
             elif tok == 'nth' : out += 'n'+ADDR_FIELD_DELIM
             elif tok == 'west' : out += 'w'+ADDR_FIELD_DELIM
             elif tok == 'wst' : out += 'w'+ADDR_FIELD_DELIM
             elif tok == 'east' : out += 'e'+ADDR_FIELD_DELIM
             elif tok == 'est' : out += 'e'+ADDR_FIELD_DELIM
             elif tok == 'first' : out += '1st'+ADDR_FIELD_DELIM
             elif tok == 'second' : out += '2nd'+ADDR_FIELD_DELIM
             elif tok == 'third' : out += '3rd'+ADDR_FIELD_DELIM
             elif tok == 'fourth' : out += '4th'+ADDR_FIELD_DELIM
             elif tok == 'fifth' : out += '5th'+ADDR_FIELD_DELIM
             elif tok == 'sixth' : out += '6th'+ADDR_FIELD_DELIM
             elif tok == 'seventh' : out += '7th'+ADDR_FIELD_DELIM
             elif tok == 'eighth' : out += '8th'+ADDR_FIELD_DELIM
             elif tok == 'ninth' : out += '9th'+ADDR_FIELD_DELIM
             elif tok == 'tenth' : out += '10th'+ADDR_FIELD_DELIM
             elif tok == 'eleventh' : out += '11th'+ADDR_FIELD_DELIM
             elif tok == 'twelfth' : out += '12th'+ADDR_FIELD_DELIM
             elif tok == 'thirteenth' : out += '13th'+ADDR_FIELD_DELIM
             elif tok == 'fourteenth' : out += '14th'+ADDR_FIELD_DELIM
             elif tok == 'fifteenth' : out += '15th'+ADDR_FIELD_DELIM
             elif tok == 'sixteenth' : out += '16th'+ADDR_FIELD_DELIM
             elif tok == 'seventeenth' : out += '17th'+ADDR_FIELD_DELIM
             elif tok == 'eighteenth' : out += '18th'+ADDR_FIELD_DELIM
             elif tok == 'nineteenth' : out += '19th'+ADDR_FIELD_DELIM
             elif tok == 'twentieth' : out += '20th'+ADDR_FIELD_DELIM
             # numbers 0 - 20
             elif tok == 'one' : out += '1'+ADDR_FIELD_DELIM
             elif tok == 'two' : out += '2'+ADDR_FIELD_DELIM
             elif tok == 'three' : out += '3'+ADDR_FIELD_DELIM
             elif tok == 'four' : out += '4'+ADDR_FIELD_DELIM
             elif tok == 'five' : out += '5'+ADDR_FIELD_DELIM
             elif tok == 'six' : out += '6'+ADDR_FIELD_DELIM
             elif tok == 'seven' : out += '7'+ADDR_FIELD_DELIM
             elif tok == 'eight' : out += '8'+ADDR_FIELD_DELIM
             elif tok == 'nine' : out += '9'+ADDR_FIELD_DELIM
             elif tok == 'ten' : out += '10'+ADDR_FIELD_DELIM
             elif tok == 'eleven' : out += '11'+ADDR_FIELD_DELIM
             elif tok == 'twelve' : out += '12'+ADDR_FIELD_DELIM
             elif tok == 'thirteen' : out += '13'+ADDR_FIELD_DELIM
             elif tok == 'fourteen' : out += '14'+ADDR_FIELD_DELIM
             elif tok == 'fifteen' : out += '15'+ADDR_FIELD_DELIM
             elif tok == 'sixteen' : out += '16'+ADDR_FIELD_DELIM
             elif tok == 'seventeen' : out += '17'+ADDR_FIELD_DELIM
             elif tok == 'eighteen' : out += '18'+ADDR_FIELD_DELIM
             elif tok == 'nineteen' : out += '19'+ADDR_FIELD_DELIM
             elif tok == 'twenty' : out += '20'+ADDR_FIELD_DELIM
             # town dictionary items
             elif tok == 'borough' : pass
             elif tok == 'city' : pass
             elif tok == 'of' : pass
             elif tok == 'the' : pass
             # a few extras (from looking at voa)
             elif tok == 'at' : pass
             elif tok == 'incl' : pass
             elif tok == 'inc' : pass
             else: out += tok+ADDR_FIELD_DELIM
         return out

    return normalise(s)
    $$ LANGUAGE plpythonu;

    Here's the create script for the table from pgAdmin (I hope that
    will be good enough instead of \d as I can't do that right now)..

    -- Table: nlpg.match_data

    -- DROP TABLE nlpg.match_data;

    CREATE TABLE nlpg.match_data
    (
       premise_id integer,
       usrn bigint,
       org text,
       sao text,
    "level" text,
       pao text,
    "name" text,
       street text,
       town text,
       pc postcode,
       postcode text,
       match_data_id integer NOT NULL DEFAULT
    nextval('nlpg.match_data_match_data_id_seq1'::regclass),
       addr_str text,
       tssearch_name tsvector,

       CONSTRAINT match_data_pkey1 PRIMARY KEY (match_data_id)
    )
    WITH (
       OIDS=FALSE
    );
    ALTER TABLE nlpg.match_data OWNER TO postgres;
    ALTER TABLE nlpg.match_data ALTER COLUMN "name" SET STATISTICS 
10000;

    -- Index: nlpg.index_match_data_mid

    -- DROP INDEX nlpg.index_match_data_mid;

    CREATE INDEX index_match_data_mid
       ON nlpg.match_data
       USING btree
       (match_data_id);

    -- Index: nlpg.index_match_data_pc

    -- DROP INDEX nlpg.index_match_data_pc;

    CREATE INDEX index_match_data_pc
       ON nlpg.match_data
       USING btree
       (pc);

    -- Index: nlpg.index_match_data_pid

    -- DROP INDEX nlpg.index_match_data_pid;

    CREATE INDEX index_match_data_pid
       ON nlpg.match_data
       USING btree
       (premise_id);

    -- Index: nlpg.index_match_data_tssearch_name

    -- DROP INDEX nlpg.index_match_data_tssearch_name;

    CREATE INDEX index_match_data_tssearch_name
       ON nlpg.match_data
       USING gin
       (tssearch_name);

    -- Index: nlpg.index_match_data_usrn

    -- DROP INDEX nlpg.index_match_data_usrn;

    CREATE INDEX index_match_data_usrn
       ON nlpg.match_data
       USING btree
       (usrn);

    As you can see, no FKs or triggers..

    I am running: EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org;

    However, as it should take around 90mins (if it is linear) then I
    thought I would send this now and follow up with the results once it
    finishes. (Has taken 2hours so far..)

    Thanks very much for your help.

    Tom

    On 28 May 2010 17:54, "Bill Moran" <wmoran@xxxxxxxxxxxxxxxxx
<mailto:wmoran@xxxxxxxxxxxxxxxxx>> wrote:

    In response to Tom Wilcox <hungrytom@xxxxxxxxxxxxxx
<mailto:hungrytom@xxxxxxxxxxxxxx>>:

> In addition, I have discovered that the update query that runs
    on each row
> of a 27million row ta...

    You're not liable to get shit for answers if you omit the mailing
    list from
    the conversation, especially since I know almost nothing about 
tuning
    PostgreSQL installed on Windows.

    Are there multiple queries having this problem?  The original
    query didn't
    have normalise() in it, and I would be highly suspicious that a 
custom
    function may have a memory leak or other memory-intensive
    side-effects.
    What is the code for that function?

    For example, does:
    UPDATE nlpg.match_data SET org = org WHERE match_data_id;
    finish in a reasonable amount of time or exhibit the same out of
    memory
    problem?

    It'd be nice to see a \d on that table ... does it have any
    triggers or
    cascading foreign keys?

    And stop

    --

    Bill Moran
    http://www.potentialtech.com
    http://people.collaborativefusion.com/~wmoran/
<http://people.collaborativefusion.com/%7Ewmoran/>

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general