Optimizing around retained tuples

James Parks <james.parks@xxxxxxxxxx> · Tue, 21 Mar 2017 12:24:52 -0700

Hi pgsql-performance!

So I have a Postgresql database -- version "PostgreSQL 9.4.8 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit", specifically.

In it, I have essentially two categories of tables:
- small tables that are updated frequently, and tend to be often queried in their entirety (seq scan)
- large tables that are updated infrequently, and tend to be often queried using an index

Let us assume that I have a table "A" that falls in the small category, and a table "B" that falls in the large category.

The problem I'm having is that it is very difficult to do any amount of maintenance on tables like B without DOSing any queries that reference table A. The reason, as far as I can tell, is that having any statement run against a table like B results in all updates to table A being kept around until the statement on table B completes (as per the READ COMMITTED transaction isolation level -- statements against B must only see rows committed before they started). This makes sense -- it's required to keep ACID.

However, there are times where I need to do large operations to B -- and these operations can be literally anything, but I'll focus on my most recent need: running a "pg_dump" against table B.

I should add that table B is never involved with any query that touches table A -- in this case, it is an append-only table that records changes to a table C that is equivalently never involved with table A.

So, on to the data from which I base the above claims:

Let table A have 43 thousand rows:
database=> select count(*) from a;
-[ RECORD 1 ]
count | 43717
Time: 10447.681 ms

Let table B have 21 million rows:
meraki_shard_production=> select count(id) from b;
-[ RECORD 1 ]---
count | 21845610
Time: 116873.051 ms

Assume a pg_dump operation is copying table B, i.e. there's a currently running query that looks like "COPY public.b (id, ...) TO STDOUT"

Then this is what I get for running a verbose vacuum against A:

database=> vacuum verbose a;
INFO:  vacuuming "public.a"
INFO:  index "a_pkey" now contains 2119583 row versions in 9424 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.04s/0.03u sec elapsed 0.49 sec.
INFO:  "a": found 0 removable, 2112776 nonremovable row versions in 185345 out of 186312 pages
DETAIL:  2069676 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 1.28s/1.15u sec elapsed 22.93 sec.
INFO:  vacuuming "pg_toast.pg_toast_18889"
INFO:  index "pg_toast_18889_index" now contains 31 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_18889": found 0 removable, 31 nonremovable row versions in 7 out of 7 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 23035.282 ms

... and here's how long it takes to read all of the rows:
database=> select max(an unindexed bigint column) from a;
-[ RECORD 1 ]--------
max | <some number>
Time: 10624.368 ms

Running this another time immediately afterward (to show the cached speed) returns:
Time: 13782.363 ms

If I go to a separate database cluster that has an equivalent schema, and roughly equivalent table a (+- 2% on the number of rows), the above queries look more like this:

meraki_shard_production=> vacuum verbose a;
INFO:  vacuuming "public.a"
INFO:  index "a_pkey" now contains 42171 row versions in 162 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "a": found 487 removable, 42286 nonremovable row versions in 7809 out of 7853 pages
DETAIL:  373 dead row versions cannot be removed yet.
There were 42436 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.02u sec elapsed 0.01 sec.
INFO:  vacuuming "pg_toast.pg_toast_19037"
INFO:  index "pg_toast_19037_index" now contains 57 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_19037": found 0 removable, 57 nonremovable row versions in 12 out of 12 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 32.890 ms

database=> select max(the same unindexed bigint column) from a;
       max       
-----------------
 <some number>
(1 row)
Time: 16.696 ms
(The second iteration takes 15.320 ms)

So, the way I see it, my problem boils down to table "A" getting roughly 100-1000x slower when it gets roughly 20-50x bigger (depending if you measure in pages or tuples). Unfortunately, in my use case, table "A" acts as a join table for a lot of aspects of our company's webapp. Every 10 minutes, the table is queried for 35 million rows via sequential scan (~800 seq scans per minute, ~1.3 per second on average), and 6.5 million rows via index lookup. When a sequential scan over 40k rows takes less than 1 second, everything is fine -- when it takes 10+ seconds the database starts to slow down significantly. Thankfully, queries can share sequential scans, but you can imagine how the responsiveness of the webapp might suffer as a consequence. There's also the secondary effect that, should the query on B complete, there now exist many queries against A (and other related tables) that are slow enough to potentially increase the size of A even further. It is not uncommon for queries involving A to start taking upwards of 30 minutes to complete, when they usually complete in roughly 300ms, after some maintenance query against B has completed.

Our go-to solution has been to detect and stop these maintenance queries if they take too long, and then to CLUSTER table A. This puts a cap on how long any maintenance query can take -- down to somewhere around 1 hour.

And thus my query to you guys:

What can I do to keep running long maintenance operations on large tables (SELECTing significant fractions of B, DELETEing significant fractions of B, running VACUUM FULL on B) without denying other Postgresql backends their ability to efficiently query table A? Or, in other words, how do I avoid incurring the cost of transaction isolation for queries against B on a case-by-case basis?

Anything is on the table for implementation:
- moving tables to a different database / cluster / completely different DBMS system
- designing an extension to tune either sets of queries
- partitioning tables
- etc
... although the simpler the better. If you were in this position, what would you do?

Regards,
James