Increasing parallelism of queries while using file fdw and partitions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Wondering if anyone had any thoughts on how to tweak my setup to get it to read many files at once instead of one at a time when using file fdw and partitions. We have a bunch of data tied up in files (each file > 4M rows, 5,000+ files per year) that I would like to be able to query directly using FDW. The files are genomic VCF format and I find that vcf_fdw ( https://github.com/ergo70/vcf_fdw ) works really well to read the data. We only want to be able to search the data as quickly as possible, no updates / deletes / ...

I gave an example below of the basic setup and the output of explain analyze. I get the same performance if I setup the table such that the thousands of files end up in one non-partitioned table or setup each file as it's own partition of the table.

I have tried increasing ( / decreasing ) the worker threads and workers, but don't see any change in the number of files open at any given time. I tried reducing the cost of parallel queries to force them to run, but can't get them to kick in.

Any ideas or anything I can try?

Thanks!

Pat

PostgreSQL:  PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit
Multicorn: 1.3.5
VCF_FDW ( https://github.com/ergo70/vcf_fdw ) : 1.0.0


CREATE DATABASE variants;

CREATE EXTENSION multicorn;

CREATE SERVER multicorn_vcf FOREIGN DATA WRAPPER multicorn OPTIONS (wrapper 'vcf_fdw.VCFForeignDataWrapper');

CREATE SCHEMA vcf;

CREATE TABLE vcf.variants ( ..., species text, ... ) PARTITION BY LIST ( species );

CREATE FOREIGN TABLE vcf.human ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'human', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.human FOR VALUES IN ( 'human' );

CREATE FOREIGN TABLE vcf.dog ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'dog', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.dog FOR VALUES IN ( 'dog' );

CREATE FOREIGN TABLE vcf.cat ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'cat', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.cat FOR VALUES IN ( 'cat' );

* My real data repeats this 1000+ more times

EXPLAIN ( ANALYZE, BUFFERS ) SELECT * FROM vcf.variants WHERE chrom = '1' AND pos = 10120 LIMIT 1000;

On my real data I get the following results:
--------------------------
                                                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=20.00..352020.00 rows=1000 width=347) (actual time=445.548..101709.307 rows=20 loops=1)
   ->  Append  (cost=20.00..3555200000.00 rows=10100000 width=347) (actual time=445.547..101709.285 rows=20 loops=1)
         ->  Foreign Scan on dog  (cost=20.00..3520000.00 rows=10000 width=352) (actual time=198.653..198.654 rows=0 loops=1)
               Filter: ((chrom = '1'::text) AND (pos = 10120))
         ->  Foreign Scan on cat  (cost=20.00..3520000.00 rows=10000 width=352) (actual time=111.840..111.840 rows=0 loops=1)
               Filter: ((chrom = '1'::text) AND (pos = 10120))
         ->  Foreign Scan on human  (cost=20.00..3520000.00 rows=10000 width=352) (actual time=135.050..138.534 rows=1 loops=1)
               Filter: ((chrom = '1'::text) AND (pos = 10120))
         ... repeats many more times for each partition
 Planning time: 613.815 ms
 Execution time: 101873.880 ms
(2024 rows)





[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux