Re: Switching to MQ by default may generate some bug reports

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote:
> >> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
> >> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
> >> ext4 as a filesystem. The same is not true for XFS so the filesystem
> >> matters.
> >> 
> > 
> > Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
> > soon as I can, thanks.
> > 
> > 
> 
> I've run this test and tried to further investigate this regression.
> For the moment, the gist seems to be that blk-mq plays an important
> role, not only with bfq (unless I'm considering the wrong numbers).
> Even if your main purpose in this thread was just to give a heads-up,
> I guess it may be useful to share what I have found out.  In addition,
> I want to ask for some help, to try to get closer to the possible
> causes of at least this regression.  If you think it would be better
> to open a new thread on this stuff, I'll do it.
> 

I don't think it's necessary unless Christoph or Jens object and I doubt
they will.

> First, I got mixed results on my system. 

For what it's worth, this is standard. In my experience, IO benchmarks
are always multi-modal, particularly on rotary storage. Cases of universal
win or universal loss for a scheduler or set of tuning are rare.

> I'll focus only on the the
> case where mq-bfq-tput achieves its worst relative performance w.r.t.
> to cfq, which happens with 64 clients.  Still, also in this case
> mq-bfq is better than cfq in all average values, but Flush.  I don't
> know which are the best/right values to look at, so, here's the final
> report for both schedulers:
> 

For what it's worth, it has often been observed that dbench overall
performance was dominated by flush costs. This is also true for the
standard reported throughput figures rather than the modified load file
elapsed time that mmtests reports. In dbench3 it was even worse where the
"performance" was dominated by whether the temporary files were deleted
before writeback started.

> CFQ
> 
>  Operation                Count    AvgLat    MaxLat
>  --------------------------------------------------
>  Flush                    13120    20.069   348.594
>  Close                   133696     0.008    14.642
>  LockX                      512     0.009     0.059
>  Rename                    7552     1.857   415.418
>  ReadX                   270720     0.141   535.632
>  WriteX                   89591   421.961  6363.271
>  Unlink                   34048     1.281   662.467
>  UnlockX                    512     0.007     0.057
>  FIND_FIRST               62016     0.086    25.060
>  SET_FILE_INFORMATION     15616     0.995   176.621
>  QUERY_FILE_INFORMATION   28734     0.004     1.372
>  QUERY_PATH_INFORMATION  170240     0.163   820.292
>  QUERY_FS_INFORMATION     28736     0.017     4.110
>  NTCreateX               178688     0.437   905.567
> 
> MQ-BFQ-TPUT
> 
> Operation                Count    AvgLat    MaxLat
>  --------------------------------------------------
>  Flush                    13504    75.828 11196.035
>  Close                   136896     0.004     3.855
>  LockX                      640     0.005     0.031
>  Rename                    8064     1.020   288.989
>  ReadX                   297600     0.081   685.850
>  WriteX                   93515   391.637 12681.517
>  Unlink                   34880     0.500   146.928
>  UnlockX                    640     0.004     0.032
>  FIND_FIRST               63680     0.045   222.491
>  SET_FILE_INFORMATION     16000     0.436   686.115
>  QUERY_FILE_INFORMATION   30464     0.003     0.773
>  QUERY_PATH_INFORMATION  175552     0.044   148.449
>  QUERY_FS_INFORMATION     29888     0.009     1.984
>  NTCreateX               183152     0.289   300.867
> 
> Are these results in line with yours for this test?
> 

Very broadly speaking yes, but it varies. On a small machine, the differences
in flush latency are visible but not as dramatic. It only has a few
CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On
the one machine I have that topped out with CFQ/BFQ at 64 threads, the
latency of flush is vaguely similar

			CFQ			BFQ			BFQ-TPUT
latency	avg-Flush-64 	287.05	( 0.00%)	389.14	( -35.57%)	349.90	( -21.90%)
latency	avg-Close-64 	0.00	( 0.00%)	0.00	( -33.33%)	0.00	( 0.00%)
latency	avg-LockX-64 	0.01	( 0.00%)	0.01	( -16.67%)	0.01	( 0.00%)
latency	avg-Rename-64 	0.18	( 0.00%)	0.21	( -16.39%)	0.18	( 3.28%)
latency	avg-ReadX-64 	0.10	( 0.00%)	0.15	( -40.95%)	0.15	( -40.95%)
latency	avg-WriteX-64 	0.86	( 0.00%)	0.81	( 6.18%)	0.74	( 13.75%)
latency	avg-Unlink-64 	1.49	( 0.00%)	1.52	( -2.28%)	1.14	( 23.69%)
latency	avg-UnlockX-64 	0.00	( 0.00%)	0.00	( 0.00%)	0.00	( 0.00%)
latency	avg-NTCreateX-64 	0.26	( 0.00%)	0.30	( -16.15%)	0.21	( 19.62%)

So, different figures to yours but the general observation that flush
latency is higher holds.

> Anyway, to investigate this regression more in depth, I took two
> further steps.  First, I repeated the same test with bfq-sq, my
> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
> from the changes needed for bfq to live in blk-mq).  I got:
> 
> <SNIP>
> 
> So, with both bfq and deadline there seems to be a serious regression,
> especially on MaxLat, when moving from legacy block to blk-mq.  The
> regression is much worse with deadline, as legacy-deadline has the
> lowest max latency among all the schedulers, whereas mq-deadline has
> the highest one.
> 

I wouldn't worry too much about max latency simply because a large
outliier can be due to multiple factors and it will be variable.
However, I accept that deadline is not necessarily great either.

> Regardless of the actual culprit of this regression, I would like to
> investigate further this issue.  In this respect, I would like to ask
> for a little help.  I would like to isolate the workloads generating
> the highest latencies.  To this purpose, I had a look at the loadfile
> client-tiny.txt, and I still have a doubt: is every item in the
> loadfile executed somehow several times (for each value of the number
> of clients), or is it executed only once?  More precisely, IIUC, for
> each operation reported in the above results, there are several items
> (lines) in the loadfile.  So, is each of these items executed only
> once?
> 

The load file is executed multiple times. The normal loadfile was
basically just the same commands, or very similar commands, run multiple
times within a single load file. This made the workload too sensitive to
the exact time the workload finished and too coarse.

> I'm asking because, if it is executed only once, then I guess I can
> find the critical tasks ore easily.  Finally, if it is actually
> executed only once, is it expected that the latency for such a task is
> one order of magnitude higher than that of the average latency for
> that group of tasks?  I mean, is such a task intrinsically much
> heavier, and then expectedly much longer, or is the fact that latency
> is much higher for this task a sign that something in the kernel
> misbehaves for that task?
> 

I don't think it's quite as easily isolated. It's all the operations in
combination that replicate the behaviour. If it was just a single operation
like "fsync" then it would be fairly straight-forward but the full mix
is relevant as it matters when writeback kicks off, when merges happen,
how much dirty data was outstanding when writeback or sync started etc.

I see you've made other responses to the thread so rather than respond
individually 

o I've queued a subset of tests with Ming's v3 patchset as that was the
  latest branch at the time I looked. It'll take quite some time to execute
  as the grid I use to collect data is backlogged with other work

o I've included pgioperf this time because it is good at demonstrate
  oddities related to fsync. Granted it's mostly simulating a database
  workload that is typically recommended to use deadline scheduler but I
  think it's still a useful demonstration 

o If you want a patch set queued that may improve workload pattern
  detection for dbench then I can add that to the grid with the caveat that
  results take time. It'll be a blind test as I'm not actively debugging
  IO-related problems right now.

o I'll keep an eye out for other workloads that demonstrate empirically
  better performance given that a stopwatch and desktop performance is
  tough to quantify even though I'm typically working in other areas. While
  I don't spend a lot of time on IO-related problems, it would still
  be preferred if switching to MQ by default was a safe option so I'm
  interested enough to keep it in mind.

-- 
Mel Gorman
SUSE Labs



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux