Hello
It is strange. I am trying to simulate it without success. On 1 M rows where every id is 2 times duplicated CREATE TABLE
postgres=# insert into foo select i, md5(i::text) from generate_series(1,1000000) g(i);
INSERT 0 1000000
postgres=# insert into foo select i, md5(i::text) from generate_series(1,1000000) g(i);
INSERT 0 1000000
postgres=# CREATE INDEX on foo(b);
CREATE INDEX
postgres=# ANALYZE foo;
ANALYZE
postgres=# explain analyze select string_agg(a::text,',') from foo group by b;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=410045.19..447831.37 rows=1022895 width=37) (actual time=10195.972..14993.493 rows=1000000 loops=1)
-> Sort (cost=410045.19..415045.19 rows=2000000 width=37) (actual time=10195.944..13659.985 rows=2000000 loops=1)
Sort Key: b
Sort Method: external merge Disk: 97768kB
-> Seq Scan on foo (cost=0.00..36667.00 rows=2000000 width=37) (actual time=0.018..321.197 rows=2000000 loops=1)
Total runtime: 15066.397 ms
(6 rows)
postgres=# explain analyze select array_agg(a::text) from foo group by b;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=410045.19..447831.37 rows=1022895 width=37) (actual time=10062.095..15697.755 rows=1000000 loops=1)
-> Sort (cost=410045.19..415045.19 rows=2000000 width=37) (actual time=10062.059..13613.300 rows=2000000 loops=1)
Sort Key: b
Sort Method: external merge Disk: 97768kB
-> Seq Scan on foo (cost=0.00..36667.00 rows=2000000 width=37) (actual time=0.029..311.423 rows=2000000 loops=1)
Total runtime: 15799.226 ms
(6 rows)
2013/8/19 Robert Sosinski <rsosinski@xxxxxxxxxxxxxxxxxxx>
At the moment, all guids are distinct, however before I zapped the duplicates, there were 280 duplicates.Currently, there are over 2 million distinct guids.-RobertOn Mon, Aug 19, 2013 at 11:12 AM, Pavel Stehule <pavel.stehule@xxxxxxxxx> wrote:
2013/8/19 Robert Sosinski <rsosinski@xxxxxxxxxxxxxxxxxxx>
Hi Pavel,What kind of example do you need? I cant give you the actual data I have in the table, but I can give you an example query and the schema attached below. From there, I would just put in 2 million rows worth 1.2 Gigs of data. Average size of the the extended columns (using the pg_column_size function) in bytes are:guid: 33name: 2.41currency: 4fields: 120.32example query:-- find duplicate records using a guidselect guid, array_agg(id) from orders group by guid;
how much distinct guid is there, and how much duplicates
??regards
Pavel
example schema:Table "public.things"Column | Type | Modifiers | Storage | Stats target | Description------------+-----------------------------+-----------------------------------------------------+----------+--------------+-------------id | integer | not null default nextval('things_id_seq'::regclass) | plain | |version | integer | not null | plain | |created_at | timestamp without time zone | not null | plain | |updated_at | timestamp without time zone | not null | plain | |foo_id | integer | not null | plain | |bar_id | integer | not null | plain | |baz_id | integer | not null | plain | |guid | character varying | not null | extended | |name | character varying | not null | extended | |price | numeric(12,2) | not null | main | |currency | character varying | not null | extended | |amount | integer | not null | plain | |the_date | date | not null | plain | |fields | hstore | | extended | |Indexes:"things_pkey" PRIMARY KEY, btree (id)"things_foo_id_idx" btree (foo_id)"things_bar_id_idx" btree (bar_id)"things_baz_id_idx" btree (baz_id)"things_guid_uidx" UNIQUE, btree (guid)"things_lpad_lower_name_eidx" btree (lpad(lower(name::text), 10, '0'::text))"things_price_idx" btree (price)Foreign-key constraints:"things_foo_id_fkey" FOREIGN KEY (foo_id) REFERENCES foos(id)"things_bar_id_fkey" FOREIGN KEY (bar_id) REFERENCES bars(id)"things_baz_id_fkey" FOREIGN KEY (baz_id) REFERENCES bazs(id)Triggers:timestamps_trig BEFORE INSERT OR UPDATE ON things FOR EACH ROW EXECUTE PROCEDURE timestamps_tfun()Let me know if you need anything else.Thanks,On Mon, Aug 19, 2013 at 3:29 AM, Pavel Stehule <pavel.stehule@xxxxxxxxx> wrote:
RegardsHelloplease, can you send some example or test?
Pavel Stehule2013/8/19 Robert Sosinski <rsosinski@xxxxxxxxxxxxxxxxxxx>
When using array_agg on a large table, memory usage seems to spike up until Postgres crashes with the following error:2013-08-17 18:41:02 UTC [2716]: [2] WARNING: terminating connection because of crash of another server process
2013-08-17 18:41:02 UTC [2716]: [3] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2013-08-17 18:41:02 UTC [2716]: [4] HINT: In a moment you should be able to reconnect to the database and repeat your command.
I've definitely isolated it down to using array_agg, as when I changed the query to use string_agg, it worked fine. I also tried using array_agg on a few different queries, all yielding the same issue. Swapping in string_agg fixed the issue once more.
This particular table has over 2 million rows and is 1.2 Gigs, and when I ran the query while viewing htop, the virtual size of the Postgres process ballooned to 13.9G until crashing.The version of Postgres I am using is: PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bitAny help would be much appreciated, thanks!
-Robert