On Sat, 15 Sep 2018, Kai Krakow wrote: > Hey! > > Am Sa., 15. Sep. 2018 um 21:47 Uhr schrieb Coly Li <colyli@xxxxxxx>: > > On 9/15/18 10:35 PM, Kai Krakow wrote: > > > In the past (I think back in 4.9), Eric sent some interesting writeback > > > hinting patches which I still carry around rebasing them with every > > > kernel version. I'm not sure if everything is still okay with these > > > patches or if I missed something during rebase, but my system is stable > > > with these. > > > > > > Background: These patches allow processes with low IO priority to bypass > > > the bcache and thus reduce write load on the SSD. This is useful for > > > maintenance and backup jobs. It also ensures that such jobs won't > > > dominate the bcache and increases the chance of keeping important data > > > in the bcache. OTOH, important and IO sensitive processes can be made to > > > always use the bcache even when sequential cutoff would normally read > > > directly from the backing device bypassing bcache. > > > > > > I wonder if there's still activity around this? Eric seems to have been > > > very silent AFAIT. > > > > > > Are there chances to get this upstream? I remember there was some > > > discussion around this. If someone is interested, I can keep these > > > patches rebased in a public repo - or maybe someone with more insight in > > > bcache offers to do that. > > > > > > Any thoughts on this? I think, Eric made some valuable patches which > > > allow lowering wear on SSDs. > > > > Hi Kai, > > > > Last time when this series was discussed, I remember there was opinion > > that the interface was complicated. > > > > IMHO we can simplify the interface by only input class level number to > > the sysfs interface, e.g. > > > > echo 7 > /sys/block/bcache0/bcache/ioprio_bypass > > echo 0 > /sys/block/bcache0/bcache/ioprio_writeback > > > > > > From Eric's document, behavior of I/O class 1 and 3 are always same, > > the only thing can be changed is for I/O class 2. I may handle it in > > other following changes. > > I don't think this is the case. But the documentation may be > misleading in this regard. You may specify > > echo 1,3 > /sys/block/bcache0/bcache/ioprio_bypass > > to bypass the bcache if ioprio is class 1 prio >= 3, which would also > bypass class 2, and 3. Kai is correct: Since realtime supports prio levels, you wouldn't be able to differentiate between realtime-7 and best-effort-7 if you tried to make this a single value. The class is still important so idle or realtime IO can be hinted at independently from best-effort. Just because the ioprio 'data' is unused for idle doesn't mean we should drop support for both idle and realtime classes---that doesn't make sense to me. Indeed, one may wish to bypass on idle (3,0) but not on best effort (2,7). > At least this is what admin-guide/bcache.rst tells me. > > But I wonder what happens if you specify something which would hint > bypass and writeback at the same time. Probably, in this case sysfs > should just reply with "invalid argument" or whatever is a reasonable > choice. Writeback is a stronger hint, setting both will always writeback and emit the warning below. + if (ioprio_valid(dc->ioprio_writeback) + && ioprio_valid(dc->ioprio_bypass) + && dc->ioprio_writeback >= dc->ioprio_bypass) + pr_warning( + "warning: ioprio_writeback hint is neither disabled nor higher priority than the bypass hint; " + "will always writeback!"); > > In general this series is OK to me, we just need to design a simpler > > interface. I may add them into my for-test directory, to test and > > understand how it works, and then improve it if necessary. Yes, thank you for putting them in your for-test directory. This is a simple implementation that saves us from burining out our SSD with unnecessary erase cycles during backup! > Given the documentation, I don't see how this could be made simpler. > It's just not as simple to understand. > > But if we think of class+prio as a 2-digit number (an octal number if > you want to), as CP with C = class and P = prio, then it becomes > pretty simple: > > process 0CP < writeback 0CP --> always writeback > process 0CP >= bypass 0CP --> always bypass I agree, I'm not sure how to make the interface simpler. ioprio already implements class and value as a masked integer with class as an integer with masks and shifts, so conceptually this is just an integer compare: include/linux/ioprio.h:18 #define IOPRIO_PRIO_VALUE(class, data) (((class) << IOPRIO_CLASS_SHIFT) | data) This code just uses a comma (,) to separate class from prio. > writeback 0CP >= bypass 0CP is obviously an invalid setting. Indeed, but it is still safe: it will just always writeback, and there is already a pr_warning for this, see above. > (leading zero indication octal notation) > > That is, the example gives > > 027 = bypass > 000 = writeback > > Since a process whatever priority would never have 0CP < 000, that's > why it's disabled in this case. But nothing prevents me from putting > 1,3 into the bypass config. It would result in the realtime class prio > >= 3 being bypassed. > > Hinting of writeback probably defaults to disabled because (a) no one > wants it without explicit knowledge, and distribution do not use class > 1 for any process AFAIT, so it makes no sense to set writeback=2,0 or > even below. > > B4 is the default setting of linux processes (best-effort, class 2, > prio 4). I've set some processes to B0. I'd probably configure > writeback=2,1 for those, or maybe even 2,4. > > But, conclusion: It can be changed. :-) Below you can see Christoph's response to the original post, with my comments added today: On 07/05/2017 11:47 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > From: Eric Wheeler <git@xxxxxxxxxxxxxxxxxx> > > > > Add sysfs entries to support to hint for bypass/writeback by the ioprio > > assigned to the bio. If the bio is unassigned, use current's io-context > > ioprio for cache writeback or bypass (configured per-process with > > `ionice`). > > > > Having idle IOs bypass the cache can increase performance elsewhere > > since you probably don't care about their performance. In addition, > > this prevents idle IOs from promoting into (polluting) your cache and > > evicting blocks that are more important elsewhere. > > > > If you really nead the performance at the expense of SSD wearout, > > then configure ioprio_writeback and set your `ionice` appropriately. > > > > For example: > > echo 2,7 > /sys/block/bcache0/bcache/ioprio_bypass > > echo 2,0 > /sys/block/bcache0/bcache/ioprio_writeback > > > > See the documentation commit for details. > > I'm really worried about this interface, as it basically uses the > ioprio field for side channel communication - your app must know > which value it wants, and you need to configure bcache to fit > exacltly that scheme. In most cases the app doesn't know what it wants, users use `ionice` to configure the app. Since ionice is about handling priorities (and therefore performance), it is reasonable to use ioprio for bypass hinting since it affects the app performance. Either way, the administrator configures both ioprio with `ionice` and bcache via sysfs. I've been toying with the idea of a cgroup implementation, but this doesn't change the fact that an admin still defines the cgroup of the app, just as they would define the ioprio with ionice. These days, however, I don't have the bandwidth to write a cgroup implementation---and the ioprio implementation is simple and well tested by the community since 2016. > > + /* If the ioprio already exists on the bio, use that. We assume that > > + * the upper layer properly assigned the calling process's ioprio to > > + * the bio being passed to bcache. Otherwise, use current's ioc. */ > > Please make this fit the normal kernel comment style. I said I would do this in my original response to Christoph, but I've not been able to. Perhaps someone can cleanup the comment style. > > + ioprio = bio_prio(bio); > > + if (!ioprio_valid(ioprio)) { > > + ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE); > > + if (ioc) { > > + if (ioprio_valid(ioc->ioprio)) > > + ioprio = ioc->ioprio; > > + put_io_context(ioc); > > + ioc = NULL; > > + } > > + } > > While get_task_io_context currently is exported it really should not > be - we should only allocate on when setting the io priority or when > forking. > > What this code really wants is the ioprio related lines of code from > blk_init_request_from_bio, which should be factored into a new helper. I said I would do this in my original response to Christoph, but I've not been able to. Perhaps someone can look into refactoring bits from blk_init_request_from_bio into a new helper and submit that with the patchset. > > + if (ioprio_valid(ioprio) && ioprio_valid(dc->ioprio_writeback) > > + && ioprio >= dc->ioprio_bypass) { > > + return true; > > + } > > Incorrect indentation, this shold be: > > if (ioprio_valid(ioprio) && ioprio_valid(dc->ioprio_writeback) && > ioprio >= dc->ioprio_bypass) > return true; > > And there is some more of this in this and the following patches. > Please run them through something like checkpatch.pl Indeed, checkpatch.pl is a good idea. Perhaps someone can go through these and do any cleanup related to the patches to submit a v2. > > + > > SHOW(__bch_cached_dev) > > { > > struct cached_dev *dc = container_of(kobj, struct cached_dev, > > @@ -183,6 +186,17 @@ SHOW(__bch_cached_dev) > > return strlen(buf); > > } > > > > + if (attr == &sysfs_ioprio_bypass) > > + return snprintf(buf, PAGE_SIZE-1, "%d,%ld\n", > > + IOPRIO_PRIO_CLASS(dc->ioprio_bypass), > > + IOPRIO_PRIO_DATA(dc->ioprio_bypass)); > > + > > + if (attr == &sysfs_ioprio_writeback) > > + return snprintf(buf, PAGE_SIZE-1, "%d,%ld\n", > > + IOPRIO_PRIO_CLASS(dc->ioprio_writeback), > > + IOPRIO_PRIO_DATA(dc->ioprio_writeback)); > > + > > + > > Please implement separate sysfs show and store function for your new > attributes instead of overloading all of them into a giant mess. These follow the existing formatting of SHOW(__bch_cached_dev), so I followed suit---but I think he is referring to a different part of the patch in STORE(__cached_dev) where I use a ioprio_hint pointer to test and adjust the natural language formatting for pr_info notices. This was intended to reduce duplicate code. In fact, it was originally written separately but I reduced it using the (expr ? true : false) operator to make the code more readable and less prone to copy-paste errors, or so I had hoped. + /* ioprio hinting: we use ioprio_hint to reduce duplicate printk verbiage */ + if (attr == &sysfs_ioprio_writeback) { + ioprio_hint = &dc->ioprio_writeback; + ioprio_type = "writeback"; + } + + if (attr == &sysfs_ioprio_bypass) { + ioprio_hint = &dc->ioprio_bypass; + ioprio_type = "bypass"; + } [...] + if (!ioprio_valid(*ioprio_hint)) + pr_info("disabled ioprio_%s hints.", ioprio_type); + else + pr_info("set hint for cache %s with priority %s: (class,level) = (%u,%u)", + ioprio_type, + ( ioprio_hint == &dc->ioprio_writeback ? "at-or-above" : "at-or-below" ), + ioprio_class, ioprio_level); In any event, perhaps someone else can pickup any changes to the code to get this mainlined since I do not have the bandwidth to tweak and test. I'm glad others are getting good use out of it, too! -- Eric Wheeler https://www.linuxglobal.com