Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

Michael Lyle <mlyle@xxxxxxxx> · Fri, 6 Oct 2017 04:57:07 -0700

OK, here's some data:  http://jar.lyle.org/~mlyle/writeback/

The complete test script is there to automate running writeback
scenarios--- NOTE DONT RUN WITHOUT EDITING THE DEVICES FOR YOUR
HARDWARE.

Only one run each way, but they take 8-9 minutes to run, we can easily
get more ;)  I compared patches 1-3 (which are uncontroversial) to
1-5.

Concerns I've heard:

- The new patches will contend for I/O bandwidth with front-end writes:

No:
 3 PATCHES: write: io=29703MB, bw=83191KB/s, iops=10398, runt=365618msec
vs
 5 PATCHES: write: io=29746MB, bw=86177KB/s, iops=10771, runt=353461msec

It may actually be slightly better-- 3% or so.

- The new patches will not improve writeback rate.

No:

3 PATCHES: the active period of the test was 366+100=466 seconds, and
at the end there was 33.4G dirty.
5 PATCHES: the active period of the test was 353+100=453 seconds, and
at the end there was 32.7G dirty.

This is a moderate improvement.

- The IO scheduler can combine the writes anyways so this type of
patch will not increase write queue merge.

No:

Average wrqm/s is 1525.4 in the 3 PATCHES dataset; average wrqm/s is
1643.7 in the 5 PATCHES dataset.

During the last 100 seconds, when ONLY WRITEBACK is occurring, wrqm is
1398.0 in 3 PATCHES, and 1811.6 with 5 PATCHES.

- Front-end latency will suffer:

No:

The datasets look the same to my eye.  By far the worst thing is the
occasional 1000ms+ times the bcache goes to sleep in both scenarios,
contending for the writeback lock (not affected by these patches, but
an item for future work if I ever get to move on to a new topic).

Conclusion:

These patches provide a small but significant improvement in writeback
rates, that can be seen with careful testing that produces actual
sequential writeback.  They lay the groundwork for further
improvements, through the use of plugging the block layer and to allow
accelerated writeback when the device is idle.

Mike

On Fri, Oct 6, 2017 at 4:09 AM, Michael Lyle <mlyle@xxxxxxxx> wrote:
> Hannes--
>
> Thanks for your input.
>
> Assuming there's contiguous data to writeback, the dataset size is
> immaterial; writeback gathers 500 extents from a btree, and writes
> back up to 64 of them at a time.  With 8k extents, the amount of data
> the writeback code is juggling at a time is about 4 megabytes at
> maximum.
>
> Optimizing writeback only does something significant when the chunks
> to write back are relatively small, and when there's actual extents
> next to each other to write back.
>
> If there's big chunks, the spinning disk takes a long time to write
> each one; and that time allows both the drive itself with native
> command queueing and the IO scheduler lots of time to combine the
> write.  Not to mention that even if there is a small delay due to
> non-sequential / short-seek the difference in performance is minimal,
> because 512k extents tie up the disk for a long time.
>
> Also, I think the test scenario doesn't really have any adjacent
> extents to writeback, which doesn't help.
>
> I will forward performance data and complete scripts to run a
> reasonable scenario.
>
> Mike
>
> On Fri, Oct 6, 2017 at 4:00 AM, Hannes Reinecke <hare@xxxxxxx> wrote:
>> On 10/06/2017 12:42 PM, Michael Lyle wrote:
>>> Coly--
>>>
>>> Holy crap, I'm not surprised you don't see a difference if you're
>>> writing with 512K size!  The potential benefit from merging is much
>>> less, and the odds of missing a merge is much smaller.  512KB is 5ms
>>> sequential by itself on a 100MB/sec disk--- lots more time to wait to
>>> get the next chunks in order, and even if you fail to merge the
>>> potential benefit is much less-- if the difference is mostly
>>> rotational latency from failing to merge then we're talking 5ms vs
>>> 5+2ms.
>>>
>>> Do you even understand what you are trying to test?
>>>
>>> Mike
>>>
>>> On Fri, Oct 6, 2017 at 3:36 AM, Coly Li <i@xxxxxxx> wrote:
>>>> On 2017/10/6 下午5:20, Michael Lyle wrote:
>>>>> Coly--
>>>>>
>>>>> I did not say the result from the changes will be random.
>>>>>
>>>>> I said the result from your test will be random, because where the
>>>>> writeback position is making non-contiguous holes in the data is
>>>>> nondeterministic-- it depends where it is on the disk at the instant
>>>>> that writeback begins.  There is a high degree of dispersion in the
>>>>> test scenario you are running that is likely to exceed the differences
>>>>> from my patch.
>>>>
>>>> Hi Mike,
>>>>
>>>> I did the test quite carefully. Here is how I ran the test,
>>>> - disable writeback by echo 0 to writeback_runing.
>>>> - write random data into cache to full or half size, then stop the I/O
>>>> immediately.
>>>> - echo 1 to writeback_runing to start writeback
>>>> - and record performance data at once
>>>>
>>>> It might be random position where the writeback starts, but there should
>>>> not be too much difference of statistical number of the continuous
>>>> blocks (on cached device). Because fio just send random 512KB blocks
>>>> onto cache device, the statistical number of contiguous blocks depends
>>>> on cache device vs. cached device size, and how full the cache device is
>>>> occupied.
>>>>
>>>> Indeed, I repeated some tests more than once (except the md raid5 and md
>>>> raid0 configurations), the results are quite sable when I see the data
>>>> charts, no big difference.
>>>>
>>>> If you feel the performance result I provided is problematic, it would
>>>> be better to let the data talk. You need to show your performance test
>>>> number to prove that the bio reorder patches are helpful for general
>>>> workloads, or at least helpful to many typical workloads.
>>>>
>>>> Let the data talk.
>>>>
>>
>> I think it would be easier for everyone concerned if Coly could attach
>> the fio script / cmdline and the bcache setup here.
>> There still is a chance that both are correct, as different hardware
>> setups are being used.
>> We've seen this many times trying to establish workable performance
>> regression metrics for I/O; depending on the hardware one set of
>> optimisations fail to deliver the expected benefit on other platforms.
>> Just look at the discussion we're having currently with Ming Lei on the
>> SCSI mailing list trying to improve sequential I/O performance.
>>
>> But please try to calm down everyone. It's not that Coly is deliberately
>> blocking your patches, it's just that he doesn't see the performance
>> benefit on his side.
>> Might be that he's using the wrong parameters, but than that should be
>> clarified once the fio script is posted.
>>
>> At the same time I don't think that the size of the dataset is
>> immaterial. Larger datasets take up more space, and inevitably add more
>> overhead just for looking up the data in memory. Plus Coly has some
>> quite high-powered NVMe for the caching device, which will affect
>> writeback patterns, too.
>>
>> Cheers,
>>
>> Hannes
>> --
>> Dr. Hannes Reinecke                Teamlead Storage & Networking
>> hare@xxxxxxx                                   +49 911 74053 688
>> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
>> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
>> HRB 21284 (AG Nürnberg)