Re: database 1.2G, pg_dump 73M?!

"Joris Dobbelsteen" <Joris@xxxxxxxxxxxxxxxxxxxxx> · Sun, 30 Mar 2008 23:44:19 +0100

>-----Original Message-----
>From: Ross Boylan [mailto:RossBoylan@xxxxxxxxxxxxxxxxxx] 
>Sent: Monday, 31 March 2008 0:23
>To: Joris Dobbelsteen
>Cc: RossBoylan@xxxxxxxxxxxxxxxxxx; pgsql-general@xxxxxxxxxxxxxx
>Subject: RE:  database 1.2G, pg_dump 73M?!
>
>
>On Sun, 2008-03-30 at 22:59 +0100, Joris Dobbelsteen wrote:
>> >-----Original Message-----
>> >From: Ross Boylan [mailto:RossBoylan@xxxxxxxxxxxxxxxxxx]
>> >Sent: Sunday, 30 March 2008 23:43
>> >To: Joris Dobbelsteen
>> >Cc: RossBoylan@xxxxxxxxxxxxxxxxxx; pgsql-general@xxxxxxxxxxxxxx
>> >Subject: Re:  database 1.2G, pg_dump 73M?!
>> >
>> >On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:
>> >>  From the top contenders, about half are indexes, so you 
>are stuck 
>> >> with ~200 MB of data in the tables.
>> >> Postgresql has some wasted space due to placement of the 
>tuples in 
>> >> a block and overhead for each block and row. I don't know
>> >those values,
>> >> but they are in the range of 24 bytes per tuple, I believe. 
>> >Secondly a
>> >> block is 8 KB by default and tuples cannot be stored into 
>multiple 
>> >> blocks (thats what toast to work around).
>> >> 
>> >> All in all: Lookup tuple sizes, if they are small than 
>the overhead 
>> >> from postgresql can be a big factor. If you are huge you loose on 
>> >> portions of unoccupied space in blocks. I believe pg_statistics 
>> >> will provide this information.
>> >There is a pg_statistic (no "s")  table, but I don't know 
>how to get 
>> >tuple size from it--the documentation refers to the source code to 
>> >figure out the codes.  Backing up a step, I don't know what a tuple 
>> >is in Postgres, and don't see an entry for it in the index.
>> 
>> It was pg_stats.
>> You get avg_width. It gives this per column.
>> 
>> So probably you want
>> SELECT tablename, SUM(avg_width)
>> FROM pg_stats
>> WHERE schemaname = 'public'
>[Thanks; I didn't know about using schemaname to limit it to 
>interesting tables]
>> GROUP BY tablename [RB added] ORDER BY tablename;
> tablename | sum 
>-----------+-----
> client    |  62
> counters  |  25
> file      | 109
> filename  |  18
> fileset   |  53
> job       | 149
> jobmedia  |  52
> media     | 226
> mediatype |  16
> path      |  82
> pool      | 179
> status    |  29
> storage   |  23
> version   |   4
>
>So, for example, if each tuple has 24 bytes of overhead, the 
>overhead more than doubles the size of the file table (18 
>bytes), which has a big record count.  Am I following correctly?

Yes.
(Note not to pin down on the 24 bytes, it varies between versions. I
think this is close enough however).

>Between the space taken up by indices and the other overhead, 
>the size difference between the sql dump and the db disk 
>useage is starting to seem more reasonable.
>
>The database uses SQL-ASCII encoding, so I'd expect the 
>textual parts (filenames and paths) to take up the same space 
>(just for the basic storage, not counting overhead/indices) in 
>Postgres as on the dump, 1 byte/character.

That's the idea.

>I'm not sure what "If you are huge you loose on portions of 
>unoccupied space in blocks" means.

A tuple has to be in exactly 1 page (or block). It cannot span multiple
pages. Hence if your tuples happens to be a half page size or slightly
larger, than only a single one will fit in a single page. So you waste
half the page. You don't have to worry about this, as your tuples are
significantly smaller than that.

(As a reference only: if a tuple is larger than a blocksize some
attributes will be evicted to the toast table or compressed. So that is
to overcome the limitations of 8000 bytes per tuple.)

[snip]
>> 
>> ># select distinct relname, reltuples, relpages from pg_class where 
>> >relkind='r' and substring(relname, 1, 3) != 'pg_';
>> >
>> >         relname         | reltuples | relpages 
>> >-------------------------+-----------+----------
>> > basefiles               |         0 |        0
>> > cdimages                |         0 |        0
>> > client                  |         2 |        1
>> > counters                |         1 |        1
>> > device                  |         0 |        0
>> > file                    |    650659 |    11558
>> > filename                |    623012 |     3958
>> > fileset                 |        22 |        1
>> > job                     |       384 |       10
>> > jobmedia                |       596 |        7
>> > location                |         0 |        0
>> > locationlog             |         0 |        0
>> > log                     |         0 |        0
>> > media                   |       245 |        9
>> > mediatype               |         2 |        1
>> > path                    |     67908 |      966
>> > pool                    |         5 |        1
>> > sql_features            |       439 |        6
>> > sql_implementation_info |        12 |        1
>> > sql_languages           |         4 |        1
>> > sql_packages            |        10 |        1
>> > sql_parts               |         9 |        1
>> > sql_sizing              |        23 |        1
>> > sql_sizing_profiles     |         0 |        0
>> > status                  |        19 |        1
>> > storage                 |         2 |        1
>> > unsavedfiles            |         0 |        0
>> > version                 |         1 |        1
>> 
>> Do relpages * 8096 / reltuples.
>> Default installs have 8 KB pages/blocks by default.
>> 
>> For file it should be ~144 bytes/tuple.
>> For filename it should be ~51 bytes/tuple.
>
>Those tuple size estimates seem consistent with the tuple 
>sizes reported earlier when the overhead is added in.  I'm impressed!
>
[snip]

- Joris

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general