Wondering if anyone had any thoughts on how to tweak my setup to get it
to read many files at once instead of one at a time when using file fdw
and partitions. We have a bunch of data tied up in files (each file > 4M
rows, 5,000+ files per year) that I would like to be able to query
directly using FDW. The files are genomic VCF format and I find that
vcf_fdw ( https://github.com/ergo70/vcf_fdw ) works really well to read
the data. We only want to be able to search the data as quickly as
possible, no updates / deletes / ...
I gave an example below of the basic setup and the output of explain
analyze. I get the same performance if I setup the table such that the
thousands of files end up in one non-partitioned table or setup each
file as it's own partition of the table.
I have tried increasing ( / decreasing ) the worker threads and workers,
but don't see any change in the number of files open at any given time.
I tried reducing the cost of parallel queries to force them to run, but
can't get them to kick in.
Any ideas or anything I can try?
Thanks!
Pat
PostgreSQL: PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit
Multicorn: 1.3.5
VCF_FDW ( https://github.com/ergo70/vcf_fdw ) : 1.0.0
CREATE DATABASE variants;
CREATE EXTENSION multicorn;
CREATE SERVER multicorn_vcf FOREIGN DATA WRAPPER multicorn OPTIONS (wrapper 'vcf_fdw.VCFForeignDataWrapper');
CREATE SCHEMA vcf;
CREATE TABLE vcf.variants ( ..., species text, ... ) PARTITION BY LIST ( species );
CREATE FOREIGN TABLE vcf.human ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'human', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.human FOR VALUES IN ( 'human' );
CREATE FOREIGN TABLE vcf.dog ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'dog', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.dog FOR VALUES IN ( 'dog' );
CREATE FOREIGN TABLE vcf.cat ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'cat', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.cat FOR VALUES IN ( 'cat' );
* My real data repeats this 1000+ more times
EXPLAIN ( ANALYZE, BUFFERS ) SELECT * FROM vcf.variants WHERE chrom = '1' AND pos = 10120 LIMIT 1000;
On my real data I get the following results:
--------------------------
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=20.00..352020.00 rows=1000 width=347) (actual time=445.548..101709.307 rows=20 loops=1)
-> Append (cost=20.00..3555200000.00 rows=10100000 width=347) (actual time=445.547..101709.285 rows=20 loops=1)
-> Foreign Scan on dog (cost=20.00..3520000.00 rows=10000 width=352) (actual time=198.653..198.654 rows=0 loops=1)
Filter: ((chrom = '1'::text) AND (pos = 10120))
-> Foreign Scan on cat (cost=20.00..3520000.00 rows=10000 width=352) (actual time=111.840..111.840 rows=0 loops=1)
Filter: ((chrom = '1'::text) AND (pos = 10120))
-> Foreign Scan on human (cost=20.00..3520000.00 rows=10000 width=352) (actual time=135.050..138.534 rows=1 loops=1)
Filter: ((chrom = '1'::text) AND (pos = 10120))
... repeats many more times for each partition
Planning time: 613.815 ms
Execution time: 101873.880 ms
(2024 rows)