Re: query a table with lots of coulmns

Björn Wittich <Bjoern_Wittich@xxxxxx> · Sat, 20 Sep 2014 09:19:09 +0200

At first, thanks for your fast and comprehensive help.

The structure of my cache table is

a text , b text NOT NULL , c text , d text , e timestamp without 
timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer 
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0

additional constraints: primary key (b)  , Unique(b), Unique(a)
Indexes : Index on a, Index on b

This table has 30 Mio rows ( will increase to 50 Mio) in future

My working table is

b text, g integer

Indexes on b and c

This table has 5 Mio rows

Scenario:

What I want to achieve :

SELECT s1,s2,s3,...s512,g,d  from <worktable> INNER JOIN <cachetable> 
USING(b) ORDER BY g

The inner join will match at least 95 % of columns of the smaller 
worktable in this example 4,75 mio rows.

Running this query takes several hours until I receive the first 
results. Query analyzing shows that the execution plan is doing 2 seq 
table scans on cache and work table.

When I divide this huge statement into

SELECT s1,s2,s3,...s512,g,d  from <worktable> INNER JOIN <cachetable> 
USING(b) WHERE g BETWEEN 1 and 10000 ORDER BY g, SELECT 
s1,s2,s3,...s512,g,d  from <worktable> INNER JOIN <cachetable> USING(b) 
WHERE g BETWEEN 10001 and 20000 ORDER BY g, ....

(I can do this because g i unique and continous id from 1 to N)

The result is fast but fireing parallel requests (4-8 times parallel) 
slows down the retrieval.

Execution plan changes when adding "BETWEEN 1 and 10000" to use the indexes.

One remark which might help: overall 90 - 95 % of the s1-s512 columns 
are 0. I am only interested in columns not equals 0. Perhaps it would 
make sense to use and array of json and enumerate only values not equals 0.

Statistics on the large table:
table size: 80 GB
toast-tablesize: 37 GB
size of indexes: 17 GB

Thanks for your help and ideas

Björn

Am 19.09.2014 23:40, schrieb Josh Berkus:
On 09/19/2014 04:51 AM, Björn Wittich wrote:
I am relatively new to postgres. I have a table with 500 coulmns and
about 40 mio rows. I call this cache table where one column is a unique
key (indexed) and the 499 columns (type integer) are some values
belonging to this key.

Now I have a second (temporary) table (only 2 columns one is the key of
my cache table) and I want  do an inner join between my temporary table
and the large cache table and export all matching rows. I found out,
that the performance increases when I limit the join to lots of small
parts.
But it seems that the databases needs a lot of disk io to gather all 499
data columns.
Is there a possibilty to tell the databases that all these colums are
always treated as tuples and I always want to get the whole row? Perhaps
the disk oraganization could then be optimized?
PostgreSQL is already a row store, which means by default you're getting
all of the columns, and the columns are stored physically adjacent to
each other.

If requesting only 1 or two columns is faster than requesting all of
them, that's pretty much certainly due to transmission time, not disk
IO.  Otherwise, please post your schema (well, a truncated version) and
your queries.

BTW, in cases like yours I've used a INT array instead of 500 columns to
good effect; it works slightly better with PostgreSQL's compression.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance