Re: [PATCH 1/2] bcache: introduce ioprio-based bypass/writeback hints

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Wed, 28 Sep 2016 11:21:53 -0700 (PDT)

On Tue, 27 Sep 2016, Kai Krakow wrote:

> Am Tue, 27 Sep 2016 08:43:09 +0700
> schrieb Pavel Goran <via-bcache@xxxxxxxxxxxx>:
> 
> > Hello Eric,
> > 
> > Tuesday, September 27, 2016, 6:17:22 AM, you wrote:
> > 
> > > Add support to bcache hinting functions and sysfs to hint by the
> > > ioprio of 'current' which can be configured with `ionice`.  
> > 
> > > Cache hinting is configurable by writing 'class,level' pairs to
> > > sysfs. These are the defaults:
> > >         echo 2,7 > /sys/block/bcache0/bcache/ioprio_bypass
> > >         echo 2,0 > /sys/block/bcache0/bcache/ioprio_writeback  
> > 
> > > (-p) IO Class    (-n) Class level       Action
> > > -----------------------------------------------------
> > > (1) Realtime      0-7                           Writeback
> > > (2) Best-effort     0                           Writeback
> > > (2) Best-effort   1-6                           Original bcache
> > > logic (2) Best-effort     7                           Bypass cache
> > > (3) Idle          n/a                           Bypass cache  
> > 
> > Not sure it's a good idea, at all. If I set cache policy to, say,
> > write-through, then I expect write-back to never happen, regardless
> > of what userspace does with IO priority.

It respects policy first of course.  You can see the top of 
should_writeback() in the patch as context:
	if (cache_mode != CACHE_MODE_WRITEBACK || [...]

> > Similarly, using low IO priority (idle or best effort-7) should not
> > make IO *slow regardless of any other IO load* (which would happen if
> > cache is completely bypassed).

Perhaps I should explain my motivation for writing such a patch:

We run 10's of terabytes through bcache in writeback mode each week for 
backup verifies.  Since most verifies read snapshots, the pattern appears 
random so bcache pollutes the cache with data that it will never read 
again. Thus, we really needed an ioprio_bypass implementation to save on 
SSD wearout costs.

We have VMs that do a lot of writing, but for which IO latency is not an 
issue. Thus, ioprio_bypasss reduces the erase cycle load on the SSDs for 
those processes as well. There are a few VMs that benefit from 100% 
caching, thus ioprio_writeback can help with performance.

Unnecessary cache reads/writes cause two problems:

1. Performance degrades because the cache is polluted with unnecessary 
   data.

2. It costs real money when SSDs wear out. I have no idea how many we've 
   purchased because of the cost of caching that provides performance we 
   need.

To address your question, though, about making IO slow regardless of other 
IO:

I tested, and even if a read is in an IO class flagged to be bypass (lower 
priority than ioprio_bypass), it will read from the cache if it's already 
hot. In addition, that bypass-flagged read for a block that was in the 
cache does not get demoted---it stays in the cache for processes that need 
it.

If not already in cache, it will read from the backing device and will not 
promote into (pollute) your cache. Having idle IOs bypass the cache can 
increase performance everywhere else since you probably don't care about 
the performance of idle IOs.

> In the end, "idle" means that you don't care about the performance of 
> the process anyways.

Exactly. If an IO is idle, then chances are you've made it idle because 
you care more about other tasks' IO performance. If idle did not bypass 
the cache, then it would pollute the cache and your more important cache 
would have less room for hot data.

> I think the kernel already supports cache hinting in the sense of
> "don't pollute my cache with these file operations". It should maybe
> better hook in with that but I think this was already discussed here.
> The outcome was that mixing different cache levels is also not a good
> idea.

I believe you are referring to the page cache, and yes it supports hinting 
through fadvise and madvise. I've actually investigated this and others 
have already tried. The best route for plumbing is by fiddling page table 
bits to store ioprio bits, but everything thinks that is a bit of a hack. 

See this thread where others have tried and been rejected: 
	https://lkml.org/lkml/2014/10/29/698

As it turns out, fadvise and madvise are too granular for most cases. 
Usually system administrators are the ones making choices about cache 
hierarchy and process priority, whereas, software developers are generally 
implementing fadvise/madvise within their userspace code. The only way to 
push fadvise/madvise down from the outside of a process call is to hook 
the system call with something ugly like LD_PRELOAD; similar 
implementations have been made like libeatmydata.so to turn off 
synchronous writes on existing software without patching and recompiling.

For those of us using direct IO (O_DIRECT), page cache hints are useless 
because the memory cache is being bypassed. This is our use case. We wish 
to avoid the page cache altogether, but influence the block layer cache 
with IO priorities.

> Currently, sticking to using "ioprio_bypass" would probably the best we
> can easily get. I strongly object to enable "ioprio_writeback" to
> bypass whatever policy was assigned to the device.
>
> So, +1 for ioprio_bypass, but NACK for ioprio_writeback...
> 

As above, the policy is respected. This ioprio implementation will not 
override the cache policy and only affects writes if the cache mode is in 
writeback.

> Maybe it would make sense to have another IO class like "bulk" which
> could work like "idle" but also bypasses writing to caches and thus
> stops flushing important data out of cache (and as a side-effect also
> reduces flash wearing for bulk operations).

If you really want idle IO that can writeback, then I can set the default 
for ioprio_writeback and/or ioprio_bypass to be disabled until a user 
explicitly sets them.

Presently, they are configured the way that I would use them in our 
infrastructure. For us, most IO priorities are in the best-effort class 
(-c 2), and levels 1-6 give us plenty of room for organizing priorities. 

Very few processes are low latency (-c 1), but those that are need to 
complete as soon as possible and are thus hinted to always writeback. It 
is also nice to have a non-realtime ioprio of 2,0 for things that need to 
writeback but do not need the throughput limiting time slicing that 
happens in the realtime class in order to provide low latency guarantees.

(One of our systems has only a 13% hit ratio because of pollution. I look 
forward to seeing what this metric becomes after tuning the ioprio 
parameters.)

The block layer is being refactored right now and core changes are not 
being accepted until the legacy scheduling code has been removed or 
deprecated in favor of the blk-mq code. Adding a fourth IO class of "bulk" 
would require modifications to the existing IO schedulers that would 
certainly be rejected upstream.

For now, adding configurable tunables to bcache is our best option to 
provide a feature that has the capacity to increase performance and the 
longevity of flash devices.

--
Eric Wheeler

> 
> 
> -- 
> Regards,
> Kai
> 
> Replies to list-only preferred.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html