Re: Very big insert/join performance problem (bacula)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This mail contains the asked plans :

I've done them with the different configurations, as I had done the effort of setting up the whole thing :)
Stats were updated between all runs. Each time is the first run of the query (that's what we have in production with bacula)
And I added the executor stats, in case ...

By the way, I think I must mention it, the whole thing runs over DRBD, but with 2 gigabyte links between the master and the slave.
And I tried deactivating replication when things got really slow (despooling in 24 hours), it changed nothing (sorts were a bit faster, 
around 20%). Server is 12 GB ram, 1 quad core xeon E5335.

PostgreSQL starts to hash filename a bit later than what I said in the first mail, because it's become bigger (it was around 30-40 million last time I did the tests).

This is the query (temp_mc is the table I've created to do my tests ...):

explain ANALYZE SELECT batch.FileIndex,
              batch.JobId,
              Path.PathId,
              Filename.FilenameId,
              batch.LStat,
              batch.MD5
  FROM temp_mc AS batch
  JOIN Path ON (batch.Path = Path.Path)
  JOIN Filename ON (batch.Name = Filename.Name);

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 1
around 1 million records to insert, seq_page_cost 1, random_page_cost 4

LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!       380.143452 elapsed 79.000938 user 44.386774 system sec
!       [415.785985 user 155.733732 sys total]
!       15848728/12934936 [24352752/50913184] filesystem blocks in/out
!       0/44188 [86/987512] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       93812/40706 [405069/184511] voluntary/involuntary context switches
! buffer usage stats:
!       Shared blocks:     877336 read,          0 written, buffer hit rate = 6.75%
!       Local  blocks:          0 read,          0 written, buffer hit rate = 0.00%
!       Direct blocks:          0 read,          0 written

 Hash Join  (cost=3923929.71..5131377.91 rows=1286440 width=91) (actual time=234021.194..380018.709 rows=1286440 loops=1)
   Hash Cond: (batch.name = filename.name)
   ->  Hash Join  (cost=880140.87..1286265.62 rows=1286440 width=102) (actual time=23184.959..102400.782 rows=1286440 loops=1)
         Hash Cond: (batch.path = path.path)
         ->  Seq Scan on temp_mc batch  (cost=0.00..49550.40 rows=1286440 width=189) (actual time=0.007..342.396 rows=1286440 loops=1)
         ->  Hash  (cost=425486.72..425486.72 rows=16746972 width=92) (actual time=23184.196..23184.196 rows=16732049 loops=1)
               ->  Seq Scan on path  (cost=0.00..425486.72 rows=16746972 width=92) (actual time=0.004..7318.850 rows=16732049 loops=1)
   ->  Hash  (cost=1436976.15..1436976.15 rows=79104615 width=35) (actual time=210831.840..210831.840 rows=79094418 loops=1)
         ->  Seq Scan on filename  (cost=0.00..1436976.15 rows=79104615 width=35) (actual time=46.324..148887.662 rows=79094418 loops=1)
 Total runtime: 380136.601 ms

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 2
the same insert, with seq_page_cost to 0.01 and random_page_cost to 0.02

DETAIL:  ! system usage stats:
!       42.378039 elapsed 28.277767 user 12.192762 system sec
!       [471.865489 user 180.499280 sys total]
!       0/4072368 [24792848/59059032] filesystem blocks in/out
!       0/0 [86/989858] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       1061/9131 [429738/200320] voluntary/involuntary context switches
! buffer usage stats:
!       Shared blocks:     251574 read,          0 written, buffer hit rate = 96.27%
!       Local  blocks:          0 read,          0 written, buffer hit rate = 0.00%
!       Direct blocks:          0 read,          0 written
LOG:  duration: 42378.373 ms  statement: 
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=381840.21..1012047.92 rows=1286440 width=91) (actual time=20284.387..42242.955 rows=1286440 loops=1)
   Hash Cond: (batch.path = path.path)
   ->  Nested Loop  (cost=0.00..583683.91 rows=1286440 width=178) (actual time=0.026..10333.636 rows=1286440 loops=1)
         ->  Seq Scan on temp_mc batch  (cost=0.00..13231.26 rows=1286440 width=189) (actual time=0.008..380.361 rows=1286440 loops=1)
         ->  Index Scan using filename_name_idx on filename  (cost=0.00..0.43 rows=1 width=35) (actual time=0.006..0.007 rows=1 loops=1286440)
               Index Cond: (filename.name = batch.name)
   ->  Hash  (cost=170049.89..170049.89 rows=16746972 width=92) (actual time=20280.729..20280.729 rows=16732049 loops=1)
         ->  Seq Scan on path  (cost=0.00..170049.89 rows=16746972 width=92) (actual time=0.005..4560.872 rows=16732049 loops=1)
 Total runtime: 42371.362 ms


The thing is that this query is ten times faster, but it's not the main point : this query stays reasonably fast even when there are
20 of it running simultaneously. Of course, as it's faster, it also has less tendancy to pile up than the other one does.

When I get 10-20 of the first one running at the same time, the queries get extremely slow (I guess they are fighting
for accessing the sort disk, because I see a lot of smaller IOs instead of the big and nice IOs I see when only one of
these queries runs). The IO subsystem seems to degrade very much when there is a lot of concurrent activity on this server.
For instance, last weekend, we had to 8 million simultaneous backups, with the hash join plan. It took 24 hours for them to complete.
If they had been alone on the server, it would have taken around 1 hour for each of them.


Of course, with these smaller cost values, there is still a batch size when the plans goes back to the first one.

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 3
seq_page_cost to 1, random_page_cost to 2. Plan is the same as Plan 1.

-------------------------------------------------------------------------------------
 Hash Join  (cost=3923961.69..5131416.88 rows=1286440 width=91)
   Hash Cond: (batch.name = filename.name)
   ->  Hash Join  (cost=880144.31..1286270.06 rows=1286440 width=102)
         Hash Cond: (batch.path = path.path)
         ->  Seq Scan on temp_mc batch  (cost=0.00..49550.40 rows=1286440 width=189)
         ->  Hash  (cost=425488.36..425488.36 rows=16747036 width=92)
               ->  Seq Scan on path  (cost=0.00..425488.36 rows=16747036 width=92)
   ->  Hash  (cost=1436989.50..1436989.50 rows=79105350 width=35)
         ->  Seq Scan on filename  (cost=0.00..1436989.50 rows=79105350 width=35)
(9 rows)

++++++++++++++++++++++++++++++++++++++++++++++++++++
Plan 4:
seq_page_cost to 1, random_page_cost back to 4, raise work_mem to 512MB. Same as Plan 1
Estimated cost hasn't changed. Is this normal ?

-------------------------------------------------------------------------------------
 Hash Join  (cost=3923961.69..5131416.88 rows=1286440 width=91)
   Hash Cond: (batch.name = filename.name)
   ->  Hash Join  (cost=880144.31..1286270.06 rows=1286440 width=102)
         Hash Cond: (batch.path = path.path)
         ->  Seq Scan on temp_mc batch  (cost=0.00..49550.40 rows=1286440 width=189)
         ->  Hash  (cost=425488.36..425488.36 rows=16747036 width=92)
               ->  Seq Scan on path  (cost=0.00..425488.36 rows=16747036 width=92)
   ->  Hash  (cost=1436989.50..1436989.50 rows=79105350 width=35)
         ->  Seq Scan on filename  (cost=0.00..1436989.50 rows=79105350 width=35)
(9 rows)

Maybe this one would scale a bit better, as there would be less sort files ? I couldn't execute it and get reliable times (sorry, the production period has started).
If necessary, I can run it again tomorrow. I had to cancel the query after more than 15 minutes, to let the server do it's regular work.



There are other things I am thinking of : maybe it would be better to have sort space on another (and not DBRD'ded) raid set ? we have a quite
cheap setup right now for the database, and I think maybe this would help scale better. I can get a filesystem in another volume group, which is not used that much for now.

Anyway, thanks for all the ideas you could have on this.

Marc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux