In a research project involving text repetition analysis (on top of public documents)
I have a VirtualMachine (CPU AMD Epyc 7502P, 128GB RAM, 12TB HDD, 2TB SSD),
running postgres 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1)
and some tables with many rows:
nsoamt=> ANALYSE VERBOSE SentenceSource;
INFO: analyzing "public.sentencesource"
INFO: "sentencesource": scanned 30000 of 9028500 pages, containing 3811990 live rows and 268323 dead rows; 30000 rows in sample, 1147218391 estimated total rows
ANALYZE
nsoamt=> ANALYSE VERBOSE SentenceToolCheck;
INFO: analyzing "public.sentencetoolcheck"
INFO: "sentencetoolcheck": scanned 30000 of 33536425 pages, containing 498508 live rows and 25143 dead rows; 30000 rows in sample, 557272538 estimated total rows
ANALYZE
nsoamt=> ANALYZE VERBOSE Document;
INFO: analyzing "public.document"
INFO: "document": scanned 30000 of 34570 pages, containing 1371662 live rows and 30366 dead rows; 30000 rows in sample, 1580612 estimated total rows
ANALYZE
The estimates for the number of rows above are accurate.
I am running this query
SELECT COUNT(*), COUNT(NULLIF(Stchk.haserrors,'f'))
FROM SentenceToolCheck Stchk
WHERE EXISTS (SELECT SSrc.sentence
FROM SentenceSource SSrc, Document Doc
WHERE SSrc.sentence = Stchk.id
AND Doc.id = SSrc.document
AND Doc.source ILIKE '/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%');
and I have 2 (related?) problems
1 - the query is making a postgresql project have 76.7 GB resident RAM usage.
Having a WORK_MEM setting of 2GB (and "simple" COUNT() results),
that was not expected.
(I risk oom-killer killing my postgres as soon as I run another concurrent
query.)
The memory settings are:
work_mem = 2GB
shared_buffers = 16GB
maintenance_work_mem = 1GB
2 - the query never finishes... (it is over 3x24hours execution by now,
and I have no ideia how far from finishing it is).
The EXPLAIN plan is:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=28630195.79..28630195.80 rows=1 width=16)
-> Nested Loop (cost=26397220.49..28628236.23 rows=261275 width=1)
-> HashAggregate (cost=26397219.92..26399832.67 rows=261275 width=8)
Group Key: ssrc.sentence
-> Hash Join (cost=73253.21..23635527.52 rows=1104676957 width=8)
Hash Cond: (ssrc.document = doc.id)
-> Seq Scan on sentencesource ssrc (cost=0.00..20540394.02 rows=1151189402 width=16)
-> Hash (cost=54310.40..54310.40 rows=1515425 width=4)
-> Seq Scan on document doc (cost=0.00..54310.40 rows=1515425 width=4)
Filter: (source ~~* '/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%'::text)
-> Index Scan using pk_sentencetoolcheck on sentencetoolcheck stchk (cost=0.57..8.53 rows=1 width=9)
Index Cond: (id = ssrc.sentence)
JIT:
Functions: 20
Options: Inlining true, Optimization true, Expressions true, Deforming true
(15 rows)
The rows=1515425 estimate on Seq Scan on document doc (cost=0.00..54310.40 rows=1515425 width=4) seems right.
The rows=1104676957 estimate on Hash Join (cost=73253.21..23635527.52 rows=1104676957 width=8) also seems right.
The rows=261275 on HashAggregate (cost=26397219.92..26399832.67 rows=261275 width=8) seems VERY WRONG!
I was expecting something like rows=1.0E+09 instead.
On a laptop (with just 80% of the rows, 32GB RAM, but all SSD disks),
I finish the query in a few hours (+/- 2 hours).
The EXPLAIN plan is different on the laptop:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=216688374.89..216688374.90 rows=1 width=16)
-> Nested Loop (cost=211388557.47..216686210.27 rows=288616 width=1)
-> Unique (cost=211388556.90..215889838.75 rows=288616 width=8)
-> Sort (cost=211388556.90..213639197.82 rows=900256370 width=8)
Sort Key: ssrc.sentence
-> Hash Join (cost=56351.51..28261726.31 rows=900256370 width=8)
Hash Cond: (ssrc.document = doc.id)
-> Seq Scan on sentencesource ssrc (cost=0.00..16453055.44 rows=948142144 width=16)
-> Hash (cost=38565.65..38565.65 rows=1084069 width=4)
-> Seq Scan on document doc (cost=0.00..38565.65 rows=1084069 width=4)
Filter: (source ~~* '/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%'::text)
-> Index Scan using pk_sentencetoolcheck on sentencetoolcheck stchk (cost=0.57..2.76 rows=1 width=9)
Index Cond: (id = ssrc.sentence)
JIT:
Functions: 18
Options: Inlining true, Optimization true, Expressions true, Deforming true
(The Unique rows estimation is also very wrong, but at least the query finishes).
I would guess that HashAggregate is behaving very badly (using to much RAM beyond WORK_MEM, amd also badly estimating the #rows and taking forever...)
Any suggestions ?