Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

Michael Lyle <mlyle@xxxxxxxx> · Mon, 9 Oct 2017 17:00:15 -0700

Hi Coly--

We talked a bunch on IRC, so I'd just like to answer a couple of the
things again here for the benefit of everyone else / further
discussion.

On Mon, Oct 9, 2017 at 11:58 AM, Coly Li <i@xxxxxxx> wrote:
> I do observe the perfect side of your patches, when I/O blocksize
> <=256kB, I see great advantage of writeback performance when bio
> reordered. Especially when blocksiz is 8K, writeback performance is 3x
> at least (because without your patch the writeback is too slow and I
> gave up after hours).
>
> The performance regression happens when fio blocksize increased to 512K
> and dirty data increased to 900GB. And when fio blocksize increased to
> 1MB and dirty data on cache increased to 900, writeback performance
> regression becomes easily recognized.
>
> An interesting behavior I observed is, for large blocksize and dirty
> data, without bio reorder patches, writeback performance is much higher
> than the bio reorder one. An example is,
> http://blog.coly.li/wp-content/uploads/2017/10/writeback_throughput_on_linear_900_1800G_cache_half.png
>
> The first 15 minutes, bcache without bio reorder performs much higher
> than the reorder one. 15 minutes later, all the writeback rate decreased
> to a similar level. That's said, most of the performance regression
> happens at the beginning when writeback starts.

I have looked at this and I believe I understand why.  Patch 4 changes
the issuance of sequential I/O.  It limits how much writeback will be
issued at a time, as opposed to being willing to issue any amount in
the previous code:

+                       if (size >= MAX_WRITESIZE_IN_PASS)
+                               break;

MAX_WRITESIZE_IN_PASS corresponds to 2500 KBytes, or it'll hit the
limit with 2 1MB blocks issued and then go back to delay.

The reason I put these lines here is to prevent from issuing a set of
writebacks that are so large that they tie up the backing disk for a
very long time and hurt interactive performance.  We could make it
tunable, though I'm hesitant to have too many tunables as it'll make
testing and verifying all this difficult.

> All the tests are under ideal situations, no writeback happens when fio
> writes dirty data onto cache device. If in more generic situations when
> less LBA contiguous dirty blocks on cache, I guess the writeback
> regression might be more obvious.

I think it is because of this conditional which is confined to only
LBA contiguous things.  We could remove the conditional but it's there
to try and preserve interactive/frontend performance during writeback.

> When dirty blocks are not LBA contiguous on cache device, for small
> dirty block size, I don't worry. Because you explained in previous
> emails clearly, the worst case is the performance backing to the numbers
> which has no bio reorder patch. But for large blocksize and large dirty
> data, that's the common case when bcache is used for distributed
> computing/storage systems like Ceph or Hadoop (multiple hard drives
> attached to a large SSD cache, normally object file sizes are from 4MB
> to 128MB).

I don't think a SSD cache will help the big block portion of workloads
like this much, as they're not access-time constrained but instead
bandwidth constrained.  That's why, default configuration of bcache is
for >=4MB sequential stuff to bypass the cache and go directly to the
spinning disks.

Even if the SSD has twice the bandwidth of the spinning disk array,
everything written back needs to be written to the SSD and then later
read for writeback, so it's no quicker at steady state.

[snip]
> I raise a green flag to your bio reorder patches :-) I do hope I could
> have seen such performance data at first time :P
>
> *Last one question*: Could you please to consider an option to
> enable/disable the bio reorder code in sysfs? It can be enabled as
> default. When people cares about large dirty data size writeback, they
> can choose to disable the bio reorder policy.

I don't think it's the bio reorder code itself, but instead a change
in how we delay for writeback.  I wrote the new stuff to try and be a
better compromise between writeback rate and interactive performance.

My biggest concern is, that both the old code and the new code-- seems
to be writing back at about only 25-33% of the speed that your disks
should be capable of in this case.  I think that's a bigger issue than
the small difference in performance in this specific case.

Mike