Re: [RFC] What about the writeback hinting patches?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 15 Sep 2018, Kai Krakow wrote:
> Hey!
> 
> Am Sa., 15. Sep. 2018 um 21:47 Uhr schrieb Coly Li <colyli@xxxxxxx>:
> > On 9/15/18 10:35 PM, Kai Krakow wrote:
> > > In the past (I think back in 4.9), Eric sent some interesting writeback
> > > hinting patches which I still carry around rebasing them with every
> > > kernel version. I'm not sure if everything is still okay with these
> > > patches or if I missed something during rebase, but my system is stable
> > > with these.
> > >
> > > Background: These patches allow processes with low IO priority to bypass
> > > the bcache and thus reduce write load on the SSD. This is useful for
> > > maintenance and backup jobs. It also ensures that such jobs won't
> > > dominate the bcache and increases the chance of keeping important data
> > > in the bcache. OTOH, important and IO sensitive processes can be made to
> > > always use the bcache even when sequential cutoff would normally read
> > > directly from the backing device bypassing bcache.
> > >
> > > I wonder if there's still activity around this? Eric seems to have been
> > > very silent AFAIT.
> > >
> > > Are there chances to get this upstream? I remember there was some
> > > discussion around this. If someone is interested, I can keep these
> > > patches rebased in a public repo - or maybe someone with more insight in
> > > bcache offers to do that.
> > >
> > > Any thoughts on this? I think, Eric made some valuable patches which
> > > allow lowering wear on SSDs.
> >
> > Hi Kai,
> >
> > Last time when this series was discussed, I remember there was opinion
> > that the interface was complicated.
> >
> > IMHO we can simplify the interface by only input class level number to
> > the sysfs interface, e.g.
> >
> >       echo 7 > /sys/block/bcache0/bcache/ioprio_bypass
> >       echo 0 > /sys/block/bcache0/bcache/ioprio_writeback
> >
> >
> >  From Eric's document, behavior of I/O class 1 and 3 are always same,
> > the only thing can be changed is for I/O class 2. I may handle it in
> > other following changes.
> 
> I don't think this is the case. But the documentation may be
> misleading in this regard. You may specify
> 
> echo 1,3 > /sys/block/bcache0/bcache/ioprio_bypass
> 
> to bypass the bcache if ioprio is class 1 prio >= 3, which would also
> bypass class 2, and 3.

Kai is correct: Since realtime supports prio levels, you wouldn't be able 
to differentiate between realtime-7 and best-effort-7 if you tried to make 
this a single value. The class is still important so idle or realtime IO 
can be hinted at independently from best-effort.  

Just because the ioprio 'data' is unused for idle doesn't mean we should 
drop support for both idle and realtime classes---that doesn't make sense 
to me.  Indeed, one may wish to bypass on idle (3,0) but not on best 
effort (2,7).

> At least this is what admin-guide/bcache.rst tells me.
> 
> But I wonder what happens if you specify something which would hint
> bypass and writeback at the same time. Probably, in this case sysfs
> should just reply with "invalid argument" or whatever is a reasonable
> choice.

Writeback is a stronger hint, setting both will always writeback and emit 
the warning below.

+		if (ioprio_valid(dc->ioprio_writeback)
+			&& ioprio_valid(dc->ioprio_bypass)
+			&& dc->ioprio_writeback >= dc->ioprio_bypass)
+			pr_warning(
+				"warning: ioprio_writeback hint is neither disabled nor higher priority than the bypass hint; "
+				"will always writeback!");
 
> > In general this series is OK to me, we just need to design a simpler
> > interface. I may add them into my for-test directory, to test and
> > understand how it works, and then improve it if necessary.

Yes, thank you for putting them in your for-test directory.  This is a 
simple implementation that saves us from burining out our SSD with 
unnecessary erase cycles during backup!

> Given the documentation, I don't see how this could be made simpler.
> It's just not as simple to understand.
> 
> But if we think of class+prio as a 2-digit number (an octal number if
> you want to), as CP with C = class and P = prio, then it becomes
> pretty simple:
> 
> process 0CP < writeback 0CP --> always writeback
> process 0CP >= bypass 0CP --> always bypass

I agree, I'm not sure how to make the interface simpler.  ioprio already 
implements class and value as a masked integer with class as an integer 
with masks and shifts, so conceptually this is just an integer compare:

include/linux/ioprio.h:18
#define IOPRIO_PRIO_VALUE(class, data)  (((class) << IOPRIO_CLASS_SHIFT) | data)

This code just uses a comma (,) to separate class from prio.
 
> writeback 0CP >= bypass 0CP is obviously an invalid setting.

Indeed, but it is still safe: it will just always writeback, and there is 
already a pr_warning for this, see above.
 
> (leading zero indication octal notation)
> 
> That is, the example gives
> 
> 027 = bypass
> 000 = writeback
> 
> Since a process whatever priority would never have 0CP < 000, that's
> why it's disabled in this case. But nothing prevents me from putting
> 1,3 into the bypass config. It would result in the realtime class prio
> >= 3 being bypassed.
> 
> Hinting of writeback probably defaults to disabled because (a) no one
> wants it without explicit knowledge, and distribution do not use class
> 1 for any process AFAIT, so it makes no sense to set writeback=2,0 or
> even below.
> 
> B4 is the default setting of linux processes (best-effort, class 2,
> prio 4). I've set some processes to B0. I'd probably configure
> writeback=2,1 for those, or maybe even 2,4.
> 
> But, conclusion: It can be changed. :-)


Below you can see Christoph's response to the original post, with my 
comments added today:

On 07/05/2017 11:47 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > From: Eric Wheeler <git@xxxxxxxxxxxxxxxxxx>
> >
> > Add sysfs entries to support to hint for bypass/writeback by the ioprio
> > assigned to the bio.  If the bio is unassigned, use current's io-context
> > ioprio for cache writeback or bypass (configured per-process with
> > `ionice`).
> >
> > Having idle IOs bypass the cache can increase performance elsewhere
> > since you probably don't care about their performance.  In addition,
> > this prevents idle IOs from promoting into (polluting) your cache and
> > evicting blocks that are more important elsewhere.
> >
> > If you really nead the performance at the expense of SSD wearout,
> > then configure ioprio_writeback and set your `ionice` appropriately.
> >
> > For example:
> > 	echo 2,7 > /sys/block/bcache0/bcache/ioprio_bypass
> > 	echo 2,0 > /sys/block/bcache0/bcache/ioprio_writeback
> >
> > See the documentation commit for details.
> 
> I'm really worried about this interface, as it basically uses the
> ioprio field for side channel communication - your app must know
> which value it wants, and you need to configure bcache to fit
> exacltly that scheme.

In most cases the app doesn't know what it wants, users use `ionice` to 
configure the app.  Since ionice is about handling priorities (and 
therefore performance), it is reasonable to use ioprio for bypass hinting 
since it affects the app performance.  Either way, the administrator 
configures both ioprio with `ionice` and bcache via sysfs.

I've been toying with the idea of a cgroup implementation, but this 
doesn't change the fact that an admin still defines the cgroup of the 
app, just as they would define the ioprio with ionice.  These days, 
however, I don't have the bandwidth to write a cgroup implementation---and 
the ioprio implementation is simple and well tested by the community since 
2016.

> > +	/* If the ioprio already exists on the bio, use that.  We assume that
> > +	 * the upper layer properly assigned the calling process's ioprio to
> > +	 * the bio being passed to bcache. Otherwise, use current's ioc. */
> 
> Please make this fit the normal kernel comment style.

I said I would do this in my original response to Christoph, but I've not 
been able to.  Perhaps someone can cleanup the comment style.
 
> > +	ioprio = bio_prio(bio);
> > +	if (!ioprio_valid(ioprio)) {
> > +		ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE);
> > +		if (ioc) {
> > +			if (ioprio_valid(ioc->ioprio))
> > +				ioprio = ioc->ioprio;
> > +			put_io_context(ioc);
> > +			ioc = NULL;
> > +		}
> > +	}
> 
> While get_task_io_context currently is exported it really should not
> be - we should only allocate on when setting the io priority or when
> forking.
> 
> What this code really wants is the ioprio related lines of code from
> blk_init_request_from_bio, which should be factored into a new helper.

I said I would do this in my original response to Christoph, but I've not 
been able to.  Perhaps someone can look into refactoring bits from 
blk_init_request_from_bio into a new helper and submit that with the 
patchset.

> > +	if (ioprio_valid(ioprio) && ioprio_valid(dc->ioprio_writeback)
> > +		&& ioprio >= dc->ioprio_bypass) {
> > +		return true;
> > +	}
> 
> Incorrect indentation, this shold be:
> 
> 	if (ioprio_valid(ioprio) && ioprio_valid(dc->ioprio_writeback) &&
> 	    ioprio >= dc->ioprio_bypass)
> 		return true;
> 
> And there is some more of this in this and the following patches.
> Please run them through something like checkpatch.pl

Indeed, checkpatch.pl is a good idea.  Perhaps someone can go through 
these and do any cleanup related to the patches to submit a v2.

> > +
> >  SHOW(__bch_cached_dev)
> >  {
> >  	struct cached_dev *dc = container_of(kobj, struct cached_dev,
> > @@ -183,6 +186,17 @@ SHOW(__bch_cached_dev)
> >  		return strlen(buf);
> >  	}
> >  
> > +	if (attr == &sysfs_ioprio_bypass)
> > +		return snprintf(buf, PAGE_SIZE-1, "%d,%ld\n",
> > +			IOPRIO_PRIO_CLASS(dc->ioprio_bypass),
> > +			IOPRIO_PRIO_DATA(dc->ioprio_bypass));
> > +
> > +	if (attr == &sysfs_ioprio_writeback)
> > +		return snprintf(buf, PAGE_SIZE-1, "%d,%ld\n",
> > +			IOPRIO_PRIO_CLASS(dc->ioprio_writeback),
> > +			IOPRIO_PRIO_DATA(dc->ioprio_writeback));
> > +
> > +
> 
> Please implement separate sysfs show and store function for your new
> attributes instead of overloading all of them into a giant mess.

These follow the existing formatting of SHOW(__bch_cached_dev), so I 
followed suit---but I think he is referring to a different part of the 
patch in STORE(__cached_dev) where I use a ioprio_hint pointer to test and 
adjust the natural language formatting for pr_info notices.  This was 
intended to reduce duplicate code.  In fact, it was originally written 
separately but I reduced it using the (expr ? true : false) operator to 
make the code more readable and less prone to copy-paste errors, or so I 
had hoped.

+	/* ioprio hinting: we use ioprio_hint to reduce duplicate printk verbiage */
+	if (attr == &sysfs_ioprio_writeback) {
+		ioprio_hint = &dc->ioprio_writeback;
+		ioprio_type = "writeback";
+	}
+
+	if (attr == &sysfs_ioprio_bypass) {
+		ioprio_hint = &dc->ioprio_bypass;
+		ioprio_type = "bypass";
+	}
[...]
+		if (!ioprio_valid(*ioprio_hint))
+			pr_info("disabled ioprio_%s hints.", ioprio_type);
+		else
+			pr_info("set hint for cache %s with priority %s: (class,level) = (%u,%u)",
+				ioprio_type,
+				( ioprio_hint == &dc->ioprio_writeback ? "at-or-above" : "at-or-below" ),
+				ioprio_class, ioprio_level);


In any event, perhaps someone else can pickup any changes to the code to 
get this mainlined since I do not have the bandwidth to tweak and test.  

I'm glad others are getting good use out of it, too!


--
Eric Wheeler
https://www.linuxglobal.com



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux