Re: Very poor read performance, query independent

Charles Nadeau <charles.nadeau@xxxxxxxxx> · Sun, 16 Jul 2017 11:20:57 +0200

Igor,

I set the work_mem to 12GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low, at most 10MB/s. All the sorting operation are now done in memory.
I lowered the work_mem back to 6GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low, at most 10MB/s. The 1st sorting operation is still done in memory, the second one to disk. I think I need about 4GB to do all sort in memory.
One thing I remember from Friday's "explain (analyze, buffers)...". I set temp_buffer and work_mem to 6GB as I read both your message one after the other. So I decided to try again: I then set work_mem=6GB, temp_buffers=6GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low again, at most 10MB/s. The 1st sorting operation is still done in memory, the second one to disk.
For the last test, I brought back work_mem and temp_buffer to their original value. The read throughput is still low.
In all cases, about 20 minutes after the query starts, it start writing to disk furiously. Here are the peak values as reported by dstat:
work_mem=12GB, temp_buffers=8MB: peak of 393MB/s
work_mem=6GB, temp_buffers=8MB: peak of 579MB/s
work_mem=6GB, temp_buffers=6GB: peak of 418MB/s
work_mem=1468006kB and temp_buffers=8MB, peak of 61MB/s
Also, at peak write, the server alomost ran of memory: the cache almost goes to 0 and it starts swapping.

This query is a bit extreme in terms of sorting. Maybe I should try to benchmark while counting all the records of my biggest table like Mark Kirkwood suggested. I'll do some more tests and post the results back to the mailing list.
Thanks!

Charles

On Fri, Jul 14, 2017 at 9:13 PM, Igor Neyman <ineyman@xxxxxxxxxxxxxx> wrote:

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]

Sent: Friday, July 14, 2017 11:35 AM

To: Igor Neyman <ineyman@xxxxxxxxxxxxxx>

Cc: Jeff Janes <jeff.janes@xxxxxxxxx>; pgsql-performance@postgresql.org

Subject: Re:  Very poor read performance, query independent

Igor,

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';

SET

flows=# explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes

FROM

   flowscompact,

   mynetworks

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr

   AND dstaddr IN

   (

      SELECT

         dstaddr

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                           QUERY PLAN                                                                                           

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)

   Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

   ->  Unique  (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)

         Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

         ->  Sort  (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 654395kB

               Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

               ->  GroupAggregate  (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                     ->  Sort  (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 3049216kB

                           Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                           ->  Gather  (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)

                                 Workers Planned: 12

                                 Workers Launched: 12

                                 Buffers: shared hit=728798037 read=82974833

                                 ->  Hash Semi Join  (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=728795193 read=82974833

                                       ->  Nested Loop  (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)

                                             Buffers: shared hit=590692229 read=14991777

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)

                                                   Buffers: shared hit=860990 read=14991777

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 0

                                                   Buffers: shared hit=589831203

                                       ->  Hash  (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 166486kB

                                             Buffers: shared hit=138102964 read=67983056

                                             ->  HashAggregate  (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=138102964 read=67983056

                                                   ->  Nested Loop Anti Join  (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 503353617

                                                         Buffers: shared hit=138102964 read=67983056

                                                         ->  Seq Scan on flows flows_1  (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)

                                                               Buffers: shared hit=138102915 read=67983056

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)

                                                               Buffers: shared hit=13

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)

                                                                     Buffers: shared hit=13

 Planning time: 0.941 ms

 Execution time: 2228345.171 ms

(48 rows)

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query. 

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps
 the most often used fields.

flows=# \d+ flowscompact;

                        View "public.flowscompact"

  Column   |           Type           | Modifiers | Storage | Description 

-----------+--------------------------+-----------+---------+-------------

 flow_id   | bigint                   |           | plain   | 

 sysuptime | bigint                   |           | plain   | 

 exaddr    | ip4                      |           | plain   | 

 dpkts     | integer                  |           | plain   | 

 doctets   | bigint                   |           | plain   | 

 first     | bigint                   |           | plain   | 

 last      | bigint                   |           | plain   | 

 srcaddr   | ip4                      |           | plain   | 

 dstaddr   | ip4                      |           | plain   | 

 srcport   | integer                  |           | plain   | 

 dstport   | integer                  |           | plain   | 

 prot      | smallint                 |           | plain   | 

 tos       | smallint                 |           | plain   | 

 tcp_flags | smallint                 |           | plain   | 

 timestamp | timestamp with time zone |           | plain   | 

View definition:

 SELECT flowstimestamp.flow_id,

    flowstimestamp.sysuptime,

    flowstimestamp.exaddr,

    flowstimestamp.dpkts,

    flowstimestamp.doctets,

    flowstimestamp.first,

    flowstimestamp.last,

    flowstimestamp.srcaddr,

    flowstimestamp.dstaddr,

    flowstimestamp.srcport,

    flowstimestamp.dstport,

    flowstimestamp.prot,

    flowstimestamp.tos,

    flowstimestamp.tcp_flags,

    flowstimestamp."timestamp"

   FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;

     ipaddr     

----------------

 192.168.0.0/24

 10.112.12.0/30

 10.112.12.4/30

 10.112.12.8/30

(4 row)

flows=# \d+ mynetworks;

                    Table "public.mynetworks"

 Column | Type | Modifiers | Storage | Stats target | Description 

--------+------+-----------+---------+--------------+-------------

 ipaddr | ip4r |           | plain   |              | 

Indexes:

    "mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;

              View "public.dstexterne"

 Column  | Type | Modifiers | Storage | Description 

---------+------+-----------+---------+-------------

 dstaddr | ip4  |           | plain   | 

View definition:

 SELECT DISTINCT flowscompact.dstaddr

   FROM flowscompact

     LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r

  WHERE mynetworks.ipaddr IS NULL;

Thanks!

Charles

Charles,

Don’t change temp_buffers.
Try to increase work_mem even more, say work_mem=’12GB’, because it’s still using disk for sorting (starting around 20^th minute as you noticed).
See if this:
“Sort Method: external merge  Disk: 3049216kB”
goes away.

Igor

-- 
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/