Re: Very poor read performance, query independent

Charles Nadeau <charles.nadeau@xxxxxxxxx> · Wed, 12 Jul 2017 12:04:49 +0200

Jeff,

Here are the 2 EXPLAINs for one of my simplest query:

flows=# SET track_io_timing = on;
LOG:  duration: 24.101 ms  statement: SET track_io_timing = on;
SET
flows=# explain (analyze, timing off) SELECT DISTINCT
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport,
flows-#    COUNT(*) AS conversation,
flows-#    SUM(doctets) / 1024 / 1024 AS mbytes 
flows-# FROM
flows-#    flowscompact,
flows-#    mynetworks 
flows-# WHERE
flows-#    mynetworks.ipaddr >>= flowscompact.srcaddr 
flows-#    AND dstaddr IN 
flows-#    (
flows(#       SELECT
flows(#          dstaddr 
flows(#       FROM
flows(#          dstexterne
flows(#    )
flows-# GROUP BY
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport 
flows-# ORDER BY
flows-#    mbytes DESC LIMIT 50;
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.3", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.4", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.5", size 639696896
LOG:  duration: 2632108.352 ms  statement: explain (analyze, timing off) SELECT DISTINCT
   srcaddr,
   dstaddr,
   dstport,
   COUNT(*) AS conversation,
   SUM(doctets) / 1024 / 1024 AS mbytes 
FROM
   flowscompact,
   mynetworks 
WHERE
   mynetworks.ipaddr >>= flowscompact.srcaddr 
   AND dstaddr IN 
   (
      SELECT
         dstaddr 
      FROM
         dstexterne
   )
GROUP BY
   srcaddr,
   dstaddr,
   dstport 
ORDER BY
   mbytes DESC LIMIT 50;
                                                                                                  QUERY PLAN                                                                                                   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=37762321.83..37762321.98 rows=50 width=52) (actual rows=50 loops=1)
   ->  Unique  (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual rows=50 loops=1)
         ->  Sort  (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual rows=50 loops=1)
               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
               Sort Method: quicksort  Memory: 563150kB
               ->  GroupAggregate  (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual rows=4691734 loops=1)
                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
                     ->  Sort  (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual rows=81896988 loops=1)
                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
                           Sort Method: external merge  Disk: 2721856kB
                           ->  Gather  (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual rows=81896988 loops=1)
                                 Workers Planned: 9
                                 Workers Launched: 9
                                 ->  Hash Semi Join  (cost=19462936.00..37622883.23 rows=249324 width=20) (actual rows=8189699 loops=10)
                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)
                                       ->  Nested Loop  (cost=0.03..18159012.30 rows=249324 width=20) (actual rows=45499045 loops=10)
                                             ->  Parallel Seq Scan on flows  (cost=0.00..16039759.79 rows=62330930 width=20) (actual rows=54155970 loops=10)
                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual rows=1 loops=541559704)
                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
                                                   Heap Fetches: 48679396
                                       ->  Hash  (cost=19462896.74..19462896.74 rows=11210 width=4) (actual rows=3099798 loops=10)
                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 141746kB
                                             ->  HashAggregate  (cost=19462829.48..19462863.11 rows=11210 width=4) (actual rows=3099798 loops=10)
                                                   Group Key: flows_1.dstaddr
                                                   ->  Nested Loop Anti Join  (cost=0.12..19182620.78 rows=560417390 width=4) (actual rows=113420172 loops=10)
                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
                                                         Rows Removed by Join Filter: 453681377
                                                         ->  Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1  (cost=0.12..9091067.70 rows=560978368 width=4) (actual rows=541559704 loops=10)
                                                               Heap Fetches: 91
                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual rows=2 loops=5415597040)
                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual rows=4 loops=10)
 Planning time: 62.066 ms
 Execution time: 2631923.716 ms
(33 rows)

flows=# explain (analyze, buffers) SELECT DISTINCT
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport,
flows-#    COUNT(*) AS conversation,
flows-#    SUM(doctets) / 1024 / 1024 AS mbytes 
flows-# FROM
flows-#    flowscompact,
flows-#    mynetworks 
flows-# WHERE
flows-#    mynetworks.ipaddr >>= flowscompact.srcaddr 
flows-#    AND dstaddr IN 
flows-#    (
flows(#       SELECT
flows(#          dstaddr 
flows(#       FROM
flows(#          dstexterne
flows(#    )
flows-# GROUP BY
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport 
flows-# ORDER BY
flows-#    mbytes DESC LIMIT 50;
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG:  duration: 2765020.327 ms  statement: explain (analyze, buffers) SELECT DISTINCT
   srcaddr,
   dstaddr,
   dstport,
   COUNT(*) AS conversation,
   SUM(doctets) / 1024 / 1024 AS mbytes 
FROM
   flowscompact,
   mynetworks 
WHERE
   mynetworks.ipaddr >>= flowscompact.srcaddr 
   AND dstaddr IN 
   (
      SELECT
         dstaddr 
      FROM
         dstexterne
   )
GROUP BY
   srcaddr,
   dstaddr,
   dstport 
ORDER BY
   mbytes DESC LIMIT 50;
                                                                                                              QUERY PLAN                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
   Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
   I/O Timings: read=5323746.860
   ->  Unique  (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
         Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
         I/O Timings: read=5323746.860
         ->  Sort  (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
               Sort Method: quicksort  Memory: 563150kB
               Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
               I/O Timings: read=5323746.860
               ->  GroupAggregate  (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
                     Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
                     I/O Timings: read=5323746.860
                     ->  Sort  (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
                           Sort Method: external merge  Disk: 2721856kB
                           Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
                           I/O Timings: read=5323746.860
                           ->  Gather  (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
                                 Workers Planned: 9
                                 Workers Launched: 9
                                 Buffers: shared hit=1116590559 read=15851133
                                 I/O Timings: read=5323746.860
                                 ->  Hash Semi Join  (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)
                                       Buffers: shared hit=1116588309 read=15851133
                                       I/O Timings: read=5323746.860
                                       ->  Nested Loop  (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
                                             Buffers: shared hit=996551813 read=15851133
                                             I/O Timings: read=5323746.860
                                             ->  Parallel Seq Scan on flows  (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
                                                   Buffers: shared hit=1634 read=15851133
                                                   I/O Timings: read=5323746.860
                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
                                                   Heap Fetches: 59971474
                                                   Buffers: shared hit=996550152
                                       ->  Hash  (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 141746kB
                                             Buffers: shared hit=120036496
                                             ->  HashAggregate  (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
                                                   Group Key: flows_1.dstaddr
                                                   Buffers: shared hit=120036496
                                                   ->  Nested Loop Anti Join  (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
                                                         Rows Removed by Join Filter: 453681377
                                                         Buffers: shared hit=120036496
                                                         ->  Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1  (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
                                                               Heap Fetches: 91
                                                               Buffers: shared hit=120036459
                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
                                                               Buffers: shared hit=10
                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
                                                                     Buffers: shared hit=10
 Planning time: 6.689 ms
 Execution time: 2764860.853 ms
(58 rows)

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.
Thanks!

Charles

On Wed, Jul 12, 2017 at 2:39 AM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles.nadeau@xxxxxxxxx> wrote:
Jeff,

I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/:
#!/bin/bash
#Random throughput
echo "Random throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4
#Random IOPS
echo "Random IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4
#Sequential throughput
echo "Sequential throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4
#Sequential IOPS
echo "Sequential IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4

I don't think any of those are directly relevant to PostgreSQL, as it doesn't use direct IO, doesn't use libaio, and is rarely going to get anywhere near 256 iodepth.  So the best they can do is put a theoretical ceiling on the performance.  Also, random IO with a 4MB stride doesn't make any sense from a PostgreSQL perspective.

Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:

I would have expected it to do better than that.  Maybe you increase the kernel readahead setting.  I've found the default to be much too small.  But it doesn't make much difference to you, as you appear to be doing random IO in your queries, not sequential.

Could you suggest another way to benchmark random reads?

Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which is close to what you report.  So I think I'd would instead focus on tuning your actual queries.  You say the problem is not query-dependent, but I think that that just means all the queries you looked at are similar.  If you looked at a query that can't use indexes, like count(unindexed_column) from biggest_table; you would find it doing much more IO than 4MB/s.

Can you pick the simplest query you actually care about, and post both an "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it?  (Preferably turning "track_io_timing" on first).

One other question I had, you said you had "2x Intel Xeon E5550", which should be 8 CPU (or 16, if the hyperthreads 
are reported as separate CPUs).  But you also said: "Also using dstat I can see that iowait time is at about 25%".  Usually if there is only one thing going on on the server, then IOWAIT won't be more than reciprocal of #CPU.  Is the server busy doing other stuff at the same time you are benchmarking it?

Cheers,

Jeff

-- 
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/