>At first, thanks for your fast and comprehensive help. > >The structure of my cache table is > >a text , b text NOT NULL , c text , d text , e timestamp without >timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer >DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0 >additional constraints: primary key (b) , Unique(b), Unique(a) >Indexes : Index on a, Index on b This looks redundant. e.g. you don't need a unique index on b if you already have a primary key on it. Can you post the complete table definition ? ... >One remark which might help: overall 90 - 95 % of the s1-s512 columns >are 0. I am only interested in columns not equals 0. Perhaps it would >make sense to use and array of json and enumerate only values not equals 0. Could you change that to replace 0 values with NULLs? This would greatly reduce your table space as Postgres is very efficient about NULLs storage: It marks all null values in a bit map within the row header so you just need about one bit per null instead of 4 bytes for zeros, and hence get rid of your I/O issue. regards, Marc Mamin ________________________________________ Von: pgsql-performance-owner@xxxxxxxxxxxxxx [pgsql-performance-owner@xxxxxxxxxxxxxx]" im Auftrag von "Björn Wittich [Bjoern_Wittich@xxxxxx] Gesendet: Samstag, 20. September 2014 09:19 An: Josh Berkus; pgsql-performance@xxxxxxxxxxxxxx Betreff: Re: query a table with lots of coulmns At first, thanks for your fast and comprehensive help. The structure of my cache table is a text , b text NOT NULL , c text , d text , e timestamp without timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0 additional constraints: primary key (b) , Unique(b), Unique(a) Indexes : Index on a, Index on b This table has 30 Mio rows ( will increase to 50 Mio) in future My working table is b text, g integer Indexes on b and c This table has 5 Mio rows Scenario: What I want to achieve : SELECT s1,s2,s3,...s512,g,d from <worktable> INNER JOIN <cachetable> USING(b) ORDER BY g The inner join will match at least 95 % of columns of the smaller worktable in this example 4,75 mio rows. Running this query takes several hours until I receive the first results. Query analyzing shows that the execution plan is doing 2 seq table scans on cache and work table. When I divide this huge statement into SELECT s1,s2,s3,...s512,g,d from <worktable> INNER JOIN <cachetable> USING(b) WHERE g BETWEEN 1 and 10000 ORDER BY g, SELECT s1,s2,s3,...s512,g,d from <worktable> INNER JOIN <cachetable> USING(b) WHERE g BETWEEN 10001 and 20000 ORDER BY g, .... (I can do this because g i unique and continous id from 1 to N) The result is fast but fireing parallel requests (4-8 times parallel) slows down the retrieval. Execution plan changes when adding "BETWEEN 1 and 10000" to use the indexes. One remark which might help: overall 90 - 95 % of the s1-s512 columns are 0. I am only interested in columns not equals 0. Perhaps it would make sense to use and array of json and enumerate only values not equals 0. Statistics on the large table: table size: 80 GB toast-tablesize: 37 GB size of indexes: 17 GB Thanks for your help and ideas Björn Am 19.09.2014 23:40, schrieb Josh Berkus: > On 09/19/2014 04:51 AM, Björn Wittich wrote: >> I am relatively new to postgres. I have a table with 500 coulmns and >> about 40 mio rows. I call this cache table where one column is a unique >> key (indexed) and the 499 columns (type integer) are some values >> belonging to this key. >> >> Now I have a second (temporary) table (only 2 columns one is the key of >> my cache table) and I want do an inner join between my temporary table >> and the large cache table and export all matching rows. I found out, >> that the performance increases when I limit the join to lots of small >> parts. >> But it seems that the databases needs a lot of disk io to gather all 499 >> data columns. >> Is there a possibilty to tell the databases that all these colums are >> always treated as tuples and I always want to get the whole row? Perhaps >> the disk oraganization could then be optimized? > PostgreSQL is already a row store, which means by default you're getting > all of the columns, and the columns are stored physically adjacent to > each other. > > If requesting only 1 or two columns is faster than requesting all of > them, that's pretty much certainly due to transmission time, not disk > IO. Otherwise, please post your schema (well, a truncated version) and > your queries. > > BTW, in cases like yours I've used a INT array instead of 500 columns to > good effect; it works slightly better with PostgreSQL's compression. > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance