Search Postgresql Archives

SELECT creates millions of temp files in a single directory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In modern versions of postgres a simple SELECT writes a couple
of millions of individual temp files into a single directory under
pgsql_tmp.
I know of no filesystem that would take such lightly, and even
ZFS gets some problems with such extremely long directories.

What is the rationale in this behaviour and how is it supposed to
be handled?

The specific case is a database that is in use for almost 20 years.
It happened to run on a pentium-2 with 768 MB memory, there it was
slow, but did work reliably.
Now it runs on a 2660v3 with 32 GB memory, and tends to exhaust that
memory.

Database size has not increased, postgres memory configuration has
not been changed, only postgres versions were gradually upgraded
from 8 to 12.

This is the memory configuration:

shared_buffers = 40MB
temp_buffers = 20MB
work_mem = 50MB
max_stack_depth = 40MB
max_files_per_process = 200

But the actual memory consumption is 30 GB (per query!), i.e. all
of the installed memory:

 UID   PID  PPID  C PRI NI     VSZ     RSS MWCHAN   STAT  TT     TIME COMMAND
 770 53143 10252 16  20  0 9359944 7796128 zio->io_ DsJ    -  3:11.29 postgres: bareos bareos fd00::118(53471) SELECT (postgres)
 770 54334 10252 17  20  0 9279780   24388 zio->io_ DsJ    -  2:58.19 postgres: parallel worker for PID 53143    (postgres)
 770 54335 10252 17  20  0 9279780   22168 zfs      DLsJ   -  2:51.30 postgres: parallel worker for PID 53143    (postgres)

This is the situation on the filesystem:
$ data12/base # du -k 16387/
9093312 16387/

$ data12/base/pgsql_tmp # du -k *
19979644        pgsql_tmp53143.0.sharedfileset

$ data12/base/pgsql_tmp/pgsql_tmp53143.0.sharedfileset # ls -la | wc
 1264755 11382788 96271672

More than a million files in a single directory, this is
inacceptable.


This is the query:
  SELECT DISTINCT Path.PathId, File.PathId, Path.Path
  FROM Path LEFT JOIN File USING (PathId)
     LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId)
  WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000

These are the stats:
  Path: 464229 live tuples, 42 MB, 49 MB index
  File: 43779725 live tuples, 1 dead tuple, 7160 MB, 2971 MB index
  PathHierarchy: 380879 live tuples, 13 MB, 17 MB index

The pathhierarchy table is irrelevant to the behaviour and can be left
out.

Vacuum and Analyze has been run right before the query.

This is the structure:
CREATE TABLE IF NOT EXISTS public.path
(
    pathid integer NOT NULL DEFAULT nextval('path_pathid_seq'::regclass),
    path text COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT path_pkey PRIMARY KEY (pathid)
)
CREATE TABLE IF NOT EXISTS public.file
(
    fileid bigint NOT NULL DEFAULT nextval('file_fileid_seq'::regclass),
    fileindex integer NOT NULL DEFAULT 0,
    jobid integer NOT NULL,
    pathid integer NOT NULL,
    deltaseq smallint NOT NULL DEFAULT 0,
    markid integer NOT NULL DEFAULT 0,
    fhinfo numeric(20,0) NOT NULL DEFAULT 0,
    fhnode numeric(20,0) NOT NULL DEFAULT 0,
    lstat text COLLATE pg_catalog."default" NOT NULL,
    md5 text COLLATE pg_catalog."default" NOT NULL,
    name text COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT file_pkey PRIMARY KEY (fileid)
)
CREATE INDEX IF NOT EXISTS file_jpfid_idx
    ON public.file USING btree
    (jobid ASC NULLS LAST, pathid ASC NULLS LAST, name COLLATE pg_catalog."default" ASC NULLS LAST)
    TABLESPACE pg_default;

And this does not give the explanation:
bareos=# explain SELECT DISTINCT Path.PathId, File.PathId, Path.Path  FROM Path LEFT JOIN File USING (PathId)
     LEFT JOIN PathHierarchy ON (Path.Pathid = PathHierarchy.PPathId)
  WHERE File.PathId IS NULL AND PathHierarchy.PPathId IS NULL LIMIT 300000;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1367204.20..1370204.20 rows=300000 width=67)
   ->  HashAggregate  (cost=1367204.20..1371346.45 rows=414225 width=67)
         Group Key: path.pathid, file.pathid, path.path
         ->  Gather  (cost=1225693.97..1364097.51 rows=414225 width=67)
               Workers Planned: 2
               ->  Parallel Hash Anti Join  (cost=1224693.97..1321675.01 rows=172594 width=67)
                     Hash Cond: (path.pathid = file.pathid)
                     ->  Parallel Hash Anti Join  (cost=6727.04..19953.72 rows=181359 width=63)
                           Hash Cond: (path.pathid = pathhierarchy.ppathid)
                           ->  Parallel Seq Scan on path  (cost=0.00..7361.29 rows=193429 width=63)
                           ->  Parallel Hash  (cost=3926.46..3926.46 rows=224046 width=4)
                                 ->  Parallel Seq Scan on pathhierarchy  (cost=0.00..3926.46 rows=224046 width=4)
                     ->  Parallel Hash  (cost=918690.59..918690.59 rows=18241547 width=4)
                           ->  Parallel Index Only Scan using file_jpfid_idx on file  (cost=0.56..918690.59 rows=18241547 width=4)
(14 rows)


Tracking this down: the first one works well, the second one bloats
memory and into millions of temp files:

bareos=# explain SELECT *  from path LEFT JOIN file USING (pathid) WHERE File.PathId IS NULL LIMIT 300000;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 Limit  (cost=9073765.86..9376924.49 rows=300000 width=193)
   ->  Gather  (cost=9073765.86..9520210.40 rows=441793 width=193)
         Workers Planned: 2
         ->  Merge Anti Join  (cost=9072765.86..9475031.10 rows=184080 width=193)
               Merge Cond: (path.pathid = file.pathid)
               ->  Sort  (cost=24345.75..24829.32 rows=193429 width=63)
                     Sort Key: path.pathid
                     ->  Parallel Seq Scan on path  (cost=0.00..7361.29 rows=193429 width=63)
               ->  Sort  (cost=9048403.94..9157853.22 rows=43779712 width=134)
                     Sort Key: file.pathid
                     ->  Seq Scan on file  (cost=0.00..1354253.12 rows=43779712 width=134)
(11 rows)

bareos=# explain SELECT Path.PathId, File.PathId, Path.Path  from path LEFT JOIN file USING (pathid) WHERE File.PathId IS NULL LIMIT 300000;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1218966.93..1311402.77 rows=300000 width=67)
   ->  Gather  (cost=1218966.93..1355091.95 rows=441793 width=67)
         Workers Planned: 2
         ->  Parallel Hash Anti Join  (cost=1217966.93..1309912.65 rows=184080 width=67)
               Hash Cond: (path.pathid = file.pathid)
               ->  Parallel Seq Scan on path  (cost=0.00..7361.29 rows=193429 width=63)
               ->  Parallel Hash  (cost=918690.59..918690.59 rows=18241547 width=4)
                     ->  Parallel Index Only Scan using file_jpfid_idx on file  (cost=0.56..918690.59 rows=18241547 width=4)
(8 rows)


Using these "parallel workers" was not my idea, they came creeping
along unsolicited with some version upgrade.
Switching them OFF deliberately, makes the first query five times
faster (5 minutes instead of 25 minutes), and makes the second
query use only 25'000 temp files and successfully deliver 25'000
result rows (instead of getting stuck with a million temp files),
apparently one temp file per result row now.

So,
1. it appears that these "parallel workers" are utterly
   counter-efficient whenever the working set does not fit into
   fast storage, and they need be switched off.
2. something with this anti-hash-whatever must be wrong. 25'000
   temp files does still not appear to be a good thing. But,
   delivering code that, by default, allows in excess of a million
   files be written in a single directory, that is just wrong.


Checking web ressources:

 * It seems now the normal behaviour to write millions of files,
   and people seem to just accept this:
   https://stackoverflow.com/q/61696008

 * Tables with some 50 mio rows seem now to be considered a "high
   row count":
   https://www.postgresql.org/message-id/
       38E9456A-7841-4F13-B72B-FD3137591972%40gmail.com
   They were not considered a "high row count" back in 2007; they were
   just normal then, and did run fine on machines with 1/50 of the
   memory. :(
   People seem to have been brainwashed by Web-Services and OLTP,
   and now think the working set must always fit in memory. But this
   is only one possible usecase, it is not the exclusive only one.





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux