Re: problem with lost connection while running long PL/R query

"David M. Kaplan" <david.kaplan@xxxxxx> · Fri, 17 May 2013 10:18:23 +0200

Hi Joe,

Thanks for responding as you would clearly be the expert on this sort of 
problem.  My current function does page through data using a cursor 
precisely to avoid out of memory problems, which is why I am somewhat 
surprised and stumped as to how this can be happening.  It does return 
all the data at once, but one call to the function would seem to work, 
so I can't see why 4 wouldn't.  I am currently planning to do some test 
runs using memory.profile() to see if each successive call to the PL/R 
function is somehow accumulating memory usage somewhere.  Perhaps I am 
not properly closing a query or something like that?

I am attaching my code.  Perhaps you will have some insight.  To give 
you a basic idea of what I am trying to do, I have separately developed 
a classification model for the state of a "system" based on the data in 
the postgresql table.  I want to apply that model to each line of the 
table.  I loop over the cursor and predict the state for batches of 
10,000 lines at a time.

Thanks again for the help.

Cheers,
David

On 05/16/2013 11:40 PM, Joe Conway wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/16/2013 08:40 AM, Tom Lane wrote:
"David M. Kaplan" <david.kaplan@xxxxxx> writes:
Thanks for the help.  You have definitely identified the problem,
but I am still looking for a solution that works for me.  I tried
setting vm.overcommit_memory=2, but this just made the query
crash quicker than before, though without killing the entire
connection to the database.  I imagine that this means that I
really am trying to use more memory than the system can handle?
I am wondering if there is a way to tell postgresql to flush a
set of table lines out to disk so that the memory they are using
can be liberated.
Assuming you don't have work_mem set to something unreasonably
large, it seems likely that the excessive memory consumption is
inside your PL/R function, and not the fault of Postgres per se.
You might try asking in some R-related forums about how to reduce
the code's memory usage.
The two "classic" approaches to this with PL/R are either create a
custom aggregate with the PL/R as the final function (i.e. work on one
group at a time) or use the SPI cursor functionality within the PL/R
function and page your data using a cursor. Not all forms of analysis
lend themselves to these approaches, but perhaps yours does.

Ultimately I would like to implement a form of R data.frame that does
the paging with a cursor transparently for you, but I have not been
able to find the time so far.

Joe

- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRlVJdAAoJEDfy90M199hle8gP+wU+GSJ44g26VBBAy3po/E/Y
9+pwxBhJe0x6v5PXtuM8Bzyy4yjlKCgzDj4XdZpEU7SYR+IKj7tWCihqc+Fuk1t1
EjR2VUJwpSMztRvEIqWW8rX/DFGaVYCt89n0neKfKL/XJ5rbqMqQAUPbxMaBtW/p
7EXo8RjVBMYibkvKrjpYJjLTuOTWkQCiXx5hc4HVFN53DYOF46rdFxMYUe5KLYTL
mZOnSoV0yrsaPGnxRIY0uzRv7ZTTBmB2o4TIWpTySx2rHNLqAJIT22wl0pfkjksH
JYvko3rWhSg7vSf+8RDN6X1eMAXcUO7H2NR5IdOoXEX2bzqTmDBQUjOcb5WR1yUd
L5XuT5WYiTpyzU8qAtPEVirwFnEwUN1tR6wDoVsseIWwXUYqSuXtg9qjFNAXZ1Hr
05yxuzexOEzLQNwSXWhsCrLdnndEHrJ6pDlLaUCPVybxwwwW9BfS2fJUz+X63M8x
l5DYbyl6q6o2J2bs4UGCTk4r/1Qq/R9pApkWzsckTtF6zl49mzwzPnh5b/JcB+4x
u17Te+s3cRGcX09lt7qf9cWkv1uUF/Qw0ntBhW8TY2HYhbWVIEmiZV1HIksXf+nw
EBFshWs2/H75OPnhN9YNq3tjCuiR7o/eaZeINfGs2LzGIJvHpcjMDBgFFTES7CYV
Y20XukH07h9XcJGTsf0o
=TwfD
-----END PGP SIGNATURE-----

--
**********************************
David M. Kaplan
Charge de Recherche 1

Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95

http://www.umr-eme.org/team/dkaplan/
http://www.amped.ird.fr/
**********************************

CREATE TYPE predict_classification_model_type
       AS (clean_pt_id int, class varchar(20));

--- NOTE: Must be superuser to create PLR functions
CREATE OR REPLACE FUNCTION predict_classification_model
       (filename varchar(256), modelname varchar(256),
        schemaname varchar(256), tablename varchar(256),
	wherecondition varchar(256) )
RETURNS SETOF predict_classification_model_type AS 
$BODY$

pg.thrownotice('Starting predict_classification_model')

mp=capture.output(memory.profile())
for (m in mp) pg.thrownotice(m)

tablename <- pg.quoteident(tablename)
if (nchar(schemaname)>0) {
  schemaname <- pg.quoteident(schemaname)
  tablename <- paste(schemaname,tablename,sep=".")
}

if (nchar(wherecondition) == 0) {
   wherecondition = 'TRUE'
}
wherecondition = pg.quoteliteral(wherecondition)

# Load in file with model.  This can be fairly large.
load(filename)
themodel <- get(modelname)

# Count number of lines in table
s <- paste('SELECT count(*) FROM',tablename,'WHERE',wherecondition)
q <- pg.spi.exec(s)
n <- q$count

# Create empty data.frame for storing result
resdf <- data.frame( clean_pt_id=rep(NA,n), class=rep(NA,n) )

# Open cursor for loading in data a bunch at a time
s <- paste('SELECT * FROM',tablename,'WHERE',wherecondition)
p <- pg.spi.prepare(s)
c <- pg.spi.cursor_open('my_cursor',p);

# Load in chunks and put data in result matrix
M <- 10000 # This being large helps avoid poorly factorizing columns
N <- ceiling(n/M)
if (N>0) {
  for (i in 1:N) {
    d <- pg.spi.factor(pg.spi.cursor_fetch(c,TRUE,as.integer(M)))

    # Fix NA in certain columns
    l <- d[,na.cols]
    l[ is.na(l) ] <- na.replace.val
    d[,na.cols] <- l

    # Predict classes
    pr <- as.character(predict(object = themodel, newdata= d, type="response"))

    # Index of first new element
    ii <- (i-1)*M+1

    pg.thrownotice(paste(ii,"of",n))

    # New stuff to place in data frame
    resdf[ii:(ii+dim(d)[1]-1),] <- cbind( d$clean_pt_id, pr )
  }
}

pg.spi.cursor_close(c)

mp=capture.output(memory.profile())
for (m in mp) pg.thrownotice(m)

return(resdf)

$BODY$ 
LANGUAGE 'plr';

CREATE OR REPLACE FUNCTION predict_classification_model
       (filename varchar(256), modelname varchar(256),
        schemaname varchar(256), tablename varchar(256) )
RETURNS SETOF predict_classification_model_type AS 
$BODY$
SELECT predict_classification_model($1,$2,$3,$4,'');
$BODY$
LANGUAGE 'SQL';

--- This is how to use the function
-- SELECT predict_classification_model('/tmp/model.randomForest.mtry_4.ntree_1500.RData','themodel','fads_stats','all_stats');
-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general