> Il giorno 08 ago 2017, alle ore 12:30, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> ha scritto: > > On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote: >>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any >>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using >>>> ext4 as a filesystem. The same is not true for XFS so the filesystem >>>> matters. >>>> >>> >>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as >>> soon as I can, thanks. >>> >>> >> >> I've run this test and tried to further investigate this regression. >> For the moment, the gist seems to be that blk-mq plays an important >> role, not only with bfq (unless I'm considering the wrong numbers). >> Even if your main purpose in this thread was just to give a heads-up, >> I guess it may be useful to share what I have found out. In addition, >> I want to ask for some help, to try to get closer to the possible >> causes of at least this regression. If you think it would be better >> to open a new thread on this stuff, I'll do it. >> > > I don't think it's necessary unless Christoph or Jens object and I doubt > they will. > >> First, I got mixed results on my system. > > For what it's worth, this is standard. In my experience, IO benchmarks > are always multi-modal, particularly on rotary storage. Cases of universal > win or universal loss for a scheduler or set of tuning are rare. > >> I'll focus only on the the >> case where mq-bfq-tput achieves its worst relative performance w.r.t. >> to cfq, which happens with 64 clients. Still, also in this case >> mq-bfq is better than cfq in all average values, but Flush. I don't >> know which are the best/right values to look at, so, here's the final >> report for both schedulers: >> > > For what it's worth, it has often been observed that dbench overall > performance was dominated by flush costs. This is also true for the > standard reported throughput figures rather than the modified load file > elapsed time that mmtests reports. In dbench3 it was even worse where the > "performance" was dominated by whether the temporary files were deleted > before writeback started. > >> CFQ >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13120 20.069 348.594 >> Close 133696 0.008 14.642 >> LockX 512 0.009 0.059 >> Rename 7552 1.857 415.418 >> ReadX 270720 0.141 535.632 >> WriteX 89591 421.961 6363.271 >> Unlink 34048 1.281 662.467 >> UnlockX 512 0.007 0.057 >> FIND_FIRST 62016 0.086 25.060 >> SET_FILE_INFORMATION 15616 0.995 176.621 >> QUERY_FILE_INFORMATION 28734 0.004 1.372 >> QUERY_PATH_INFORMATION 170240 0.163 820.292 >> QUERY_FS_INFORMATION 28736 0.017 4.110 >> NTCreateX 178688 0.437 905.567 >> >> MQ-BFQ-TPUT >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13504 75.828 11196.035 >> Close 136896 0.004 3.855 >> LockX 640 0.005 0.031 >> Rename 8064 1.020 288.989 >> ReadX 297600 0.081 685.850 >> WriteX 93515 391.637 12681.517 >> Unlink 34880 0.500 146.928 >> UnlockX 640 0.004 0.032 >> FIND_FIRST 63680 0.045 222.491 >> SET_FILE_INFORMATION 16000 0.436 686.115 >> QUERY_FILE_INFORMATION 30464 0.003 0.773 >> QUERY_PATH_INFORMATION 175552 0.044 148.449 >> QUERY_FS_INFORMATION 29888 0.009 1.984 >> NTCreateX 183152 0.289 300.867 >> >> Are these results in line with yours for this test? >> > > Very broadly speaking yes, but it varies. On a small machine, the differences > in flush latency are visible but not as dramatic. It only has a few > CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On > the one machine I have that topped out with CFQ/BFQ at 64 threads, the > latency of flush is vaguely similar > > CFQ BFQ BFQ-TPUT > latency avg-Flush-64 287.05 ( 0.00%) 389.14 ( -35.57%) 349.90 ( -21.90%) > latency avg-Close-64 0.00 ( 0.00%) 0.00 ( -33.33%) 0.00 ( 0.00%) > latency avg-LockX-64 0.01 ( 0.00%) 0.01 ( -16.67%) 0.01 ( 0.00%) > latency avg-Rename-64 0.18 ( 0.00%) 0.21 ( -16.39%) 0.18 ( 3.28%) > latency avg-ReadX-64 0.10 ( 0.00%) 0.15 ( -40.95%) 0.15 ( -40.95%) > latency avg-WriteX-64 0.86 ( 0.00%) 0.81 ( 6.18%) 0.74 ( 13.75%) > latency avg-Unlink-64 1.49 ( 0.00%) 1.52 ( -2.28%) 1.14 ( 23.69%) > latency avg-UnlockX-64 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > latency avg-NTCreateX-64 0.26 ( 0.00%) 0.30 ( -16.15%) 0.21 ( 19.62%) > > So, different figures to yours but the general observation that flush > latency is higher holds. > >> Anyway, to investigate this regression more in depth, I took two >> further steps. First, I repeated the same test with bfq-sq, my >> out-of-tree version of bfq for legacy block (identical to mq-bfq apart >> from the changes needed for bfq to live in blk-mq). I got: >> >> <SNIP> >> >> So, with both bfq and deadline there seems to be a serious regression, >> especially on MaxLat, when moving from legacy block to blk-mq. The >> regression is much worse with deadline, as legacy-deadline has the >> lowest max latency among all the schedulers, whereas mq-deadline has >> the highest one. >> > > I wouldn't worry too much about max latency simply because a large > outliier can be due to multiple factors and it will be variable. > However, I accept that deadline is not necessarily great either. > >> Regardless of the actual culprit of this regression, I would like to >> investigate further this issue. In this respect, I would like to ask >> for a little help. I would like to isolate the workloads generating >> the highest latencies. To this purpose, I had a look at the loadfile >> client-tiny.txt, and I still have a doubt: is every item in the >> loadfile executed somehow several times (for each value of the number >> of clients), or is it executed only once? More precisely, IIUC, for >> each operation reported in the above results, there are several items >> (lines) in the loadfile. So, is each of these items executed only >> once? >> > > The load file is executed multiple times. The normal loadfile was > basically just the same commands, or very similar commands, run multiple > times within a single load file. This made the workload too sensitive to > the exact time the workload finished and too coarse. > >> I'm asking because, if it is executed only once, then I guess I can >> find the critical tasks ore easily. Finally, if it is actually >> executed only once, is it expected that the latency for such a task is >> one order of magnitude higher than that of the average latency for >> that group of tasks? I mean, is such a task intrinsically much >> heavier, and then expectedly much longer, or is the fact that latency >> is much higher for this task a sign that something in the kernel >> misbehaves for that task? >> > > I don't think it's quite as easily isolated. It's all the operations in > combination that replicate the behaviour. If it was just a single operation > like "fsync" then it would be fairly straight-forward but the full mix > is relevant as it matters when writeback kicks off, when merges happen, > how much dirty data was outstanding when writeback or sync started etc. > > I see you've made other responses to the thread so rather than respond > individually > > o I've queued a subset of tests with Ming's v3 patchset as that was the > latest branch at the time I looked. It'll take quite some time to execute > as the grid I use to collect data is backlogged with other work > > o I've included pgioperf this time because it is good at demonstrate > oddities related to fsync. Granted it's mostly simulating a database > workload that is typically recommended to use deadline scheduler but I > think it's still a useful demonstration > > o If you want a patch set queued that may improve workload pattern > detection for dbench then I can add that to the grid with the caveat that > results take time. It'll be a blind test as I'm not actively debugging > IO-related problems right now. > > o I'll keep an eye out for other workloads that demonstrate empirically > better performance given that a stopwatch and desktop performance is > tough to quantify even though I'm typically working in other areas. While > I don't spend a lot of time on IO-related problems, it would still > be preferred if switching to MQ by default was a safe option so I'm > interested enough to keep it in mind. > Hi Mel, thanks for your thorough responses (I'm about to write something about the read-write unfairness issue, with, again, some surprise). I want to reply only to your last point above. With our responsiveness benchmark of course you don't need a stopwatch, but, yes, to get some minimally comprehensive results you need a machine with at least a desktop application like a terminal installed. Thanks, Paolo > -- > Mel Gorman > SUSE Labs