Re: bcache making tiny writes to backing device every second

Jure Erznožnik <jure.erznoznik@xxxxxxxxx> · Fri, 23 Dec 2016 10:21:32 +0100

May I ask this another way:

echo 65536 > /sys/block/bcache0/bcache/writeback_rate

This causes bcache to flush dirty data pretty much instantly thereby
eliminating the issue. After that writeback_rate reverts back to 512.

Other than creating a cron job, is there any way I could set minimum
writeback_rate value to 65536?

Thanks,
Jure

On Thu, Dec 22, 2016 at 8:25 AM, Jure Erznožnik
<jure.erznoznik@xxxxxxxxx> wrote:
>>>We could have better hysteresis though, so we're not doing that slow steady trickle of writes.
>
> There is nothing between /dev/md0 and /dev/bcache0, the entire array
> is cached, no partitions. LVM is set up on top of bcache and iostat
> shows "first" traffic at the /dev/md0. While the trickle is going on,
> there's no traffic on bcache device or LVM partitions. I have now
> modified sequential_cutoff to ensure that everything is cached (though
> an 800K write was cached even before).
>
> I have documented some logs in the original post here:
> http://unix.stackexchange.com/questions/329477/what-is-grinding-my-hdds-and-how-do-i-stop-it
>
> If this is not the source of tiny writes to the array, can you suggest
> where else I could start looking?
>
> Thanks,
> Jure
>
> On Wed, Dec 21, 2016 at 11:22 PM, Kent Overstreet
> <kent.overstreet@xxxxxxxxx> wrote:
>> On Wed, Dec 21, 2016 at 02:36:02PM +0100, Jure Erznožnik wrote:
>>> Hello,
>>>
>>> I apologise if this is something known, but my searching across the
>>> internet has revealed no answer for my issue, so I am attempting to
>>> find one here.
>>>
>>> uname -a: Linux htpc 4.8.0-32-generic #34-Ubuntu SMP Tue Dec 13
>>> 14:30:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> bcache-tools version: 1.0.8-2 (as provided in ubuntu yakkety apt repository)
>>>
>>> I have placed bcache in writeback mode over an mdadm array, followed
>>> by LVM and actual volumes that are then used by various services. The
>>> problem I'm experiencing is that for every write I make into the
>>> array, bcache makes periodic writes every second a few KB (less than
>>> 20KB/s) to the backing device.
>>>
>>> All bcache parameters are at default, here I list the writeback relevant ones:
>>> writeback_delay=30
>>> writeback_percent=10
>>> writeback_rate=512 (reverts to soon 512 even if changed)
>>> writeback_rate_d_term=30
>>> writeback_rate_p_term_inverse=6000
>>> writeback_rate_update_seconds=5
>>> writeback_running=5
>>>
>>> I don't see how writeback would be running every second, except if
>>> that's implied by writeback_rate. Increasing that to a large value
>>> temporarily causes the cache to flush much faster thus reducing the
>>> number of disk "clicks". It reverts to 512 again as soon as dirty_data
>>> goes below the large value.
>>>
>>> looking at writeback_rate_debug when the one-second flushes start, I
>>> can see that a few kilobytes are being flushed each second. Values of
>>> "writeback_rate_debug->dirty" field during one such session: 880k,
>>> 784k, 624k, 524k, 460k, 408k, 300k, 160k, 128k (128k remains and
>>> doesn't get flushed)
>>>
>>> I'm not sure what size one block is, but I configured the cache device
>>> with 4KB block size, so here's what I expected to happen:
>>> 30 seconds after the 880k write to disk, writeback should trigger and
>>> write up to 512*4KB = 2MB of data to the disk. Since the write was
>>> only 880k, that would be written in one go. Instead I got at least 8
>>> writes, each with only a few kilobytes.
>>>
>>> I have three questions about this:
>>> 1. What am I missing? Why does the data get flushed so slowly? These
>>> flushes can take hours for larger writes causing the disks to
>>> constantly work with only kilobytes per second.
>>
>> It's because when writeback_percent is nonzero, we try to keep some amount of
>> dirty data in the cache: the assumption is that recent writes are more likely to
>> either be overwritten, or to have new data written that's contiguous or nearly
>> contiguous, so we'll do less work if we delay writeback.
>>
>> We could have better hysteresis though, so we're not doing that slow steady
>> trickle of writes.
>>
>>> 2. I'd like bcache to flush the dirty data (entirely) ASAP after the
>>> writeback_delay. How can I tell it to do that?
>>
>> Set writeback_percent to 0.
>>
>> The downside though is that scanning for dirty data when there's very little
>> dirty data is expensive, and we have to block foreground writes while we're
>> scanning - so doing that will adversely affect performance.
>>
>>> 3. Is it possible to configure it such that the flushing would only
>>> take place if backing device wasn't under heavy read use at the time?
>>> I don't mind dirty data residing on SSD if that allows for faster
>>> overall operation.
>>
>> Unfortunately, we don't have anything like that implemented.
>>
>> That would be a really nice feature, but it'd be difficult to get right, since
>> it requires knowing the future (if we issue this write, will it end up blocking
>> a read? To answer that, we have to know if a read is going to come in before the
>> write completes). We can guess - we can estimate how much read traffic is going
>> to come in in the next few seconds based on how much read traffic we've seen
>> recently, on the assumption that read traffic is bursty - on timescales long
>> enough to be useful - and not completely random. However, this would mean we'd
>> be adding yet another feedback control loop to writeback - such things are
>> tricky to get right, and adding another would make the overall behaviour of
>> writeback even more complicated and difficult to understand and debug.
>>
>> Ideally, we'd be able to just issue writeback writes with an appropriate IO
>> priority and the IO scheduler would just do the right thing - it just wouldn't
>> issue writeback writes if there was a higher priority read to be issued (that
>> is, any foreground read).
>>
>> Unfortunately, this doesn't work in practice because of the writeback caching
>> that disk drives do: the (kernel side) IO scheduler has no ability to schedule
>> writes because writes just go into the disks's write cache, and then the disk
>> itself schedules it later (and the disk has no knowledge of IO priorities).
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html