Re: Append only tables

Kurt Roeckx <kurt@xxxxxxxxx> · Tue, 24 Mar 2020 19:41:51 +0100

On Mon, Mar 23, 2020 at 02:35:48PM +0100, Laurenz Albe wrote:
> On Fri, 2020-03-20 at 22:50 +0100, Kurt Roeckx wrote:
> > I have a few tables that are append only. Thre are only gets insert
> > and select queries, never update or delete.
> > 
> > What I see is that every file is still being updated. It's
> > currently about 500 GB big, and every of that almost 500 files has
> > been touched the past 24 hours.
> > 
> > I assume that that the free space map is being used, and that it
> > still finds places where it can insert a row in one of the files.
> > 
> > (auto) vacuum is not happening on the table.
> 
> This is probably the first reader setting hint bits on the table rows.
> 
> To determine whether a row is visible or not, the first reader has
> to consult the commit log to see if the xmin and xmax special columns
> of the row belong to committed transactions or not.
> 
> To make life easier for future readers, it will then set special
> flags on the row that provide that information without the requirement
> to consult the commit log.
> 
> This modifies the row, even if the data don't change, and the row
> has to be written again.
> 
> > Is there a way I can turn off this behaviour, and that it really
> > only writes to the last few pages?
> 
> You can explicitly read or vacuum the new rows, that will set the
> hint bits.
> 
> But, as has been explained, at some point the table will have to receive
> an anti-wraparound vacuum that will freeze old rows.
> 
> So the best you can do is to VACUUM (FREEZE) the table after you load
> data.  Then the table should not be modified any more.

I did a normal vacuum, and it seems to be behaving better, it's
not writing all over the old files anymore.

I think I'll set autovacuum_freeze_max_age a lot lower than the default
200 M.

Note that this is not a static table, it will always be adding
more rows.

The behaviour I'm now still seeing is that it's reading the table
all over during import of new data. I assume that also caused the
writes before.

I would really like to avoid all the random reads, but I'm not
sure it can use the data from the index to avoid needing to read
the datafile itself.

My table looks like this:
                                  Table "public.raw_certificates"
       Column       |  Type   | Collation | Nullable |                   Default
--------------------+---------+-----------+----------+---------------------------------------------
 certificate        | bytea   |           | not null |
 id                 | bigint  |           | not null | nextval('raw_certificate_id_seq'::regclass)
 sha256_fingerprint | bytea   |           | not null |
 pre_certificate    | boolean |           | not null |
Indexes:
    "raw_certificates_pkey" PRIMARY KEY, btree (id)
    "raw_certificates_sha256_fingerprint_key" UNIQUE CONSTRAINT, btree (sha256_fingerprint)
Referenced by:
    TABLE "certificates" CONSTRAINT "certificates_raw_certificate_id_fkey" FOREIGN KEY (raw_certificate_id) REFERENCES raw_certificates(id)
    TABLE "ct_entry" CONSTRAINT "ct_entry_raw_certificate_id_fkey" FOREIGN KEY (raw_certificate_id) REFERENCES raw_certificates(id)

To import data into it, I currently do:
CREATE TEMP TABLE import_certs (certificate bytea not null, sha256_fingerprint bytea)
COPY import_certs (certificate) FROM stdin
update import_certs set sha256_fingerprint = digest(certificate, 'sha256')
insert into raw_certificates (sha256_fingerprint, certificate, pre_certificate) select i.sha256_fingerprint, i.certificate, false from import_certs as i on conflict do nothing

The behaviour I currently see is:
- It's doing a read from a raw_certificates_sha256_fingerprint_key file,
then from a raw_certificates file, then again from
raw_certificates_sha256_fingerprint_key, repeating this about 5
times.
- Then it does a write and a read to the import_certs table.

I guess that after reading from the index, it needs to check the
table that it's still visible or something. There isn't a way to
avoid this?

The write to the import_certs is very confusing to me.

Anyway, the major reason for the sha256_fingerprint field is just
to remove the duplicates. I would only like to have 1 copy of each
certificate in that table. Does anybody have a suggestion on how
to improve the performance?

Once I catch up with all the old data again, I expect this table
alone to be in the order of 10 TB, and grow at around 250 GB /
month. And I think I need to start to consider moving it to SSDs
to keep up.

Kurt