Re: Using array instead of sub table (storage and speed)

Lutz Fischer <l.fischer@xxxxxxxx> · Fri, 16 Jun 2017 12:37:47 +0200

Hi Stephen,

Thanks for your reply. The data in the sub table (sp) are only read in 
as a block. Meaning I will always read in all entries in [sp] that 
belong to one entry in [s]. Meaning I would not lose much in terms of 
what I could do with the data in [sp] and I could be saving around  2.8K 
per entry in [s] (just counting the overhead for each tuple in [sp]) per 
entry in [s]

One thing I would still wonder is in how far this would affect the 
performance retrieving data from [s].

I often need some data from [s] where I don't care about [sp]. So in how 
far does having these arrays a part of [s] would make these queries 
slower. Or would be better to store the array data in a separate table 
e.g. have [s] as it is now but turn [sp] into an array aggregated table.

Thanks,

Lutz

On 15/06/17 15:37, Stephen Frost wrote:
Greetings,

* Lutz Fischer (l.fischer@xxxxxxxx) wrote:
Data in [sp] are never changed. I can probably reduce the size by
changing datatypes from numeric to float but I was wondering if it
would be more efficient - primarily in terms of storage -  to change
the structure to have two arrays in [s]. E.g.
The short answer is 'probably', but it really depends on how wide your
rows are.

I haven't really found anything yet how much space (e.g. how many
bytes) an array will use compared to a table row in postgresql.
There's a 24-byte overhead for each tuple.  If the width of the tuple's
columns ends up being less than 24 bytes then half (or more) of the
space used is for the tuple header.  Arrays have a bit of overhead
themsleves but are then densely packed.

In testing that I've done, a table which looks like:

CREATE TABLE t1 (
   c1 int
);

Will end up with a couple hundred rows per 8k page (perhaps 250 or so),
meaning that you get ~1k of actual data for 8k of storage.  Changing
this to an array, like so:

CREATE TABLE t1 (
   c1 int[]
);

And then storing 3-4 tuples per 8k page (keeping each tuple under the 2k
TOAST limit) lead to being able to store something like 450 ints per
tuple with a subsequent result of 1800 ints per page and ~7.2k worth of
actual data for 8k of storage, which was much more efficient for
storage.

Of course, the tuple header is actually useful data in many
environments- if you go with this approach then you have to work out how
to deal with the fact that a given tuple is either visible or not, and
all the ints in the array for that tuple are all visible and that an
update to that tuple locks the entire tuple and that set of ints, etc.
If the data isn't changed after being loaded and you're able to load an
entire tuple all at once then this could work.

Note that arrays aren't more efficient than just using individual
columns, and this technique is only going to be helpful if the tuple
overhead in your situation is a large portion of the data and using this
technique allows you to reduce the number of tuples stored.

Thanks!

Stephen

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance