I am working with very large sets of time-series data. Imagine a table with a timestamp as the primary key. One question I need to ask of my data is: Are there gaps of time greater than some interval between consecutive rows? I wrote a function in plpgsql to answer this question and it worked great. Being a python zealot I decided to rewrite the function in plpythonu to compare performance. While initial comparisons seemed inconclusive, after testing on large queries (over a million records) I discovered ever-increasing time to complete the exact same query and massive memory growth in my postgres process to the point of memory starvation in under 15 queries. I've reduced my my schema to one table with one timestamp column, one type and two functions in a schema named plpythonu_bug and saved with: `pg_dump -n plpythonu_bug -s -O > bug.sql`. It is attached. Here are some statistics on two separate psql sessions, one where I ran this plpgsql function several times: EXPLAIN ANALYZE SELECT count(*) from gaps('2008-01-01', '2010-01-01', '1 min'); Then a second session running the exact same query but with the plpythonu function, pygaps. Note: I had over 273,000 rows in my table. The function returned 5103 rows each run. Memory usage is from `top` output. Milliseconds, from output of explain analyze. This is on an Ubuntu 10.04 system w/ 2GB RAM, postgres 8.4.6, python 2.6.5. plpgsql function ---------------- Run # Virt Res ms before 101m 3500 n/a 1 103m 17m 584 2 104m 17m 561 3 104m 18m 579 ...etc... (virtually no movement over several runs) plpythonu function ------------------ Run # Virt Res ms before 101m 3492 n/a 1 213m 122m 1836 2 339m 246m 1784 3 440m 346m 2178 ...and so on, about 100m or so increase with each run such that in a dozen or so runs I had 1.5g in resident memory and single calls to the function taking over 45 seconds. My schema is attached. Thanks for any help and insight, Dan Popowich
-- -- PostgreSQL database dump -- SET statement_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = off; SET check_function_bodies = false; SET client_min_messages = warning; SET escape_string_warning = off; -- -- Name: plpythonu_bug; Type: SCHEMA; Schema: -; Owner: - -- CREATE SCHEMA plpythonu_bug; SET search_path = plpythonu_bug, pg_catalog; -- -- Name: timerange; Type: TYPE; Schema: plpythonu_bug; Owner: - -- CREATE TYPE timerange AS ( begin_ts timestamp without time zone, end_ts timestamp without time zone ); -- -- Name: gaps(timestamp without time zone, timestamp without time zone, interval); Type: FUNCTION; Schema: plpythonu_bug; Owner: - -- CREATE FUNCTION gaps(start_ts timestamp without time zone, end_ts timestamp without time zone, gap_length interval) RETURNS SETOF timerange LANGUAGE plpgsql AS $$ DECLARE prev timestamp; curr timestamp; tr timerange; BEGIN FOR curr IN SELECT ts FROM timeseries WHERE ts BETWEEN start_ts AND end_ts ORDER BY ts LOOP IF curr - prev > gap_length THEN tr.begin_ts := prev; tr.end_ts := curr; RETURN NEXT tr; END IF; prev := curr; END LOOP; RETURN; END; $$; -- -- Name: pygaps(timestamp without time zone, timestamp without time zone, interval); Type: FUNCTION; Schema: plpythonu_bug; Owner: - -- CREATE FUNCTION pygaps(start_ts timestamp without time zone, end_ts timestamp without time zone, gap_length interval) RETURNS SETOF timerange LANGUAGE plpythonu AS $$ # because pg passes date/time to python as strings I'm using pg to # recompute values as seconds so I have numbers to do math gap = plpy.execute("select extract(epoch from '%s'::interval) as sec" % gap_length)[0]['sec'] results = plpy.execute("""select ts, extract(epoch from ts) as epoch from timeseries where ts between '%s' and '%s'""" % (start_ts, end_ts)) if results.nrows() < 2: return # prime the well by setting prev(ious) to the first tic and # iterate starting with the second... prev = results[0] for curr in results[1:]: # yield timestamp pairs for gaps of timestamps greater than gap if curr['epoch'] - prev['epoch'] > gap: yield dict(begin_ts=prev['ts'], end_ts=curr['ts']) prev = curr return $$; SET default_tablespace = ''; SET default_with_oids = false; -- -- Name: timeseries; Type: TABLE; Schema: plpythonu_bug; Owner: -; Tablespace: -- CREATE TABLE timeseries ( ts timestamp without time zone ); -- -- PostgreSQL database dump complete --
-- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general