Re: bcache making tiny writes to backing device every second

Jure Erznožnik <jure.erznoznik@xxxxxxxxx> · Thu, 22 Dec 2016 08:25:44 +0100

>>We could have better hysteresis though, so we're not doing that slow steady trickle of writes.

There is nothing between /dev/md0 and /dev/bcache0, the entire array
is cached, no partitions. LVM is set up on top of bcache and iostat
shows "first" traffic at the /dev/md0. While the trickle is going on,
there's no traffic on bcache device or LVM partitions. I have now
modified sequential_cutoff to ensure that everything is cached (though
an 800K write was cached even before).

I have documented some logs in the original post here:
http://unix.stackexchange.com/questions/329477/what-is-grinding-my-hdds-and-how-do-i-stop-it

If this is not the source of tiny writes to the array, can you suggest
where else I could start looking?

Thanks,
Jure

On Wed, Dec 21, 2016 at 11:22 PM, Kent Overstreet
<kent.overstreet@xxxxxxxxx> wrote:
> On Wed, Dec 21, 2016 at 02:36:02PM +0100, Jure Erznožnik wrote:
>> Hello,
>>
>> I apologise if this is something known, but my searching across the
>> internet has revealed no answer for my issue, so I am attempting to
>> find one here.
>>
>> uname -a: Linux htpc 4.8.0-32-generic #34-Ubuntu SMP Tue Dec 13
>> 14:30:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> bcache-tools version: 1.0.8-2 (as provided in ubuntu yakkety apt repository)
>>
>> I have placed bcache in writeback mode over an mdadm array, followed
>> by LVM and actual volumes that are then used by various services. The
>> problem I'm experiencing is that for every write I make into the
>> array, bcache makes periodic writes every second a few KB (less than
>> 20KB/s) to the backing device.
>>
>> All bcache parameters are at default, here I list the writeback relevant ones:
>> writeback_delay=30
>> writeback_percent=10
>> writeback_rate=512 (reverts to soon 512 even if changed)
>> writeback_rate_d_term=30
>> writeback_rate_p_term_inverse=6000
>> writeback_rate_update_seconds=5
>> writeback_running=5
>>
>> I don't see how writeback would be running every second, except if
>> that's implied by writeback_rate. Increasing that to a large value
>> temporarily causes the cache to flush much faster thus reducing the
>> number of disk "clicks". It reverts to 512 again as soon as dirty_data
>> goes below the large value.
>>
>> looking at writeback_rate_debug when the one-second flushes start, I
>> can see that a few kilobytes are being flushed each second. Values of
>> "writeback_rate_debug->dirty" field during one such session: 880k,
>> 784k, 624k, 524k, 460k, 408k, 300k, 160k, 128k (128k remains and
>> doesn't get flushed)
>>
>> I'm not sure what size one block is, but I configured the cache device
>> with 4KB block size, so here's what I expected to happen:
>> 30 seconds after the 880k write to disk, writeback should trigger and
>> write up to 512*4KB = 2MB of data to the disk. Since the write was
>> only 880k, that would be written in one go. Instead I got at least 8
>> writes, each with only a few kilobytes.
>>
>> I have three questions about this:
>> 1. What am I missing? Why does the data get flushed so slowly? These
>> flushes can take hours for larger writes causing the disks to
>> constantly work with only kilobytes per second.
>
> It's because when writeback_percent is nonzero, we try to keep some amount of
> dirty data in the cache: the assumption is that recent writes are more likely to
> either be overwritten, or to have new data written that's contiguous or nearly
> contiguous, so we'll do less work if we delay writeback.
>
> We could have better hysteresis though, so we're not doing that slow steady
> trickle of writes.
>
>> 2. I'd like bcache to flush the dirty data (entirely) ASAP after the
>> writeback_delay. How can I tell it to do that?
>
> Set writeback_percent to 0.
>
> The downside though is that scanning for dirty data when there's very little
> dirty data is expensive, and we have to block foreground writes while we're
> scanning - so doing that will adversely affect performance.
>
>> 3. Is it possible to configure it such that the flushing would only
>> take place if backing device wasn't under heavy read use at the time?
>> I don't mind dirty data residing on SSD if that allows for faster
>> overall operation.
>
> Unfortunately, we don't have anything like that implemented.
>
> That would be a really nice feature, but it'd be difficult to get right, since
> it requires knowing the future (if we issue this write, will it end up blocking
> a read? To answer that, we have to know if a read is going to come in before the
> write completes). We can guess - we can estimate how much read traffic is going
> to come in in the next few seconds based on how much read traffic we've seen
> recently, on the assumption that read traffic is bursty - on timescales long
> enough to be useful - and not completely random. However, this would mean we'd
> be adding yet another feedback control loop to writeback - such things are
> tricky to get right, and adding another would make the overall behaviour of
> writeback even more complicated and difficult to understand and debug.
>
> Ideally, we'd be able to just issue writeback writes with an appropriate IO
> priority and the IO scheduler would just do the right thing - it just wouldn't
> issue writeback writes if there was a higher priority read to be issued (that
> is, any foreground read).
>
> Unfortunately, this doesn't work in practice because of the writeback caching
> that disk drives do: the (kernel side) IO scheduler has no ability to schedule
> writes because writes just go into the disks's write cache, and then the disk
> itself schedules it later (and the disk has no knowledge of IO priorities).
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html