Re: [PATCH] block: BFQ default for single queue devices

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Wed, 3 Oct 2018 08:53:03 +0000

Linus,

On 2018/10/03 17:28, Linus Walleij wrote:
> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
> 
>> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
>> definitely single hw queue). For these, using mq-deadline is mandatory in many
>> cases in order to guarantee sequential write command delivery to the device
>> driver. Having the default changed to bfq, which as far as I know is not SMR
>> friendly (can sequential writes within a single zone be reordered ?) is asking
>> for troubles (unaligned write errors showing up).
> 
> Ah, that is interesting.
> 
> Which device driver files are we talking about here, specifically?
> I'd like to take a look.

Currently, sd.c (SCSI disk) as well as null_blk can expose host-managed zoned
block devices.

> I guess what you say is not that you are looking for the deadline
> scheduling per se (as in deadline scheduling is nice), what you want is
> the zone locking semantics in that scheduler, is that right?

Yes, correct. The scheduling policy in itself does not really matter, but should
not deviate from the mandatory HM write policy: "within a sequential write
required zone, writes must be issued sequentially". That could somewhat impacts
the scheduler code itself if said scheduler think that not dispatching
sequential writes in sequence is a good idea :) No sane scheduler would that
though (at least on HDDs) so the impact on the scheduler code itself is reduced.

> I.e. this business:
> blk_queue_is_zoned(q)
> blk_req_zone_write_lock(rq);
> blk_req_zone_write_unlock(rq);
> and mq-deadline solves this with a spinlock.

Yes. These are the helper functions handling the zone write locking to simplify
the task of the scheduler to limit the number of in-flight write request to one
per zone at most at any time. This is the trick to avoid write reordering
stack-wide, since most of the time, reordering happens not because of the
scheduler itself, but the blk-mq (or legacy path) around it (e.g. requeue due to
resource shortage, multiple contexts running the queues, etc).

> I will augment the patch to enforce mq-deadline
> if blk_queue_is_zoned(q) is true, as it is clear that
> any device with that characteristic must use mq-deadline.
> 
> Paoly might be interested in looking into whether BFQ could
> also handle zoned devices in the future, I have no idea of how
> hard that would be.

It was rather easy with deadline, but the scheduler code was simple to start
with. Basically, the only thing needed is on dispatch to skip any write request
to a zone that is already locked (i.e. a write is already ongoing). For reads,
there are no constraints so nothing needs to be changed. Zone unlocking must be
done on completion of the write request so the scheduler completion method needs
to change a little too.

> The zoned business seems a bit fragile. Should it even be
> allowed to select any other scheduler than deadline on these
> devices? Presenting all compiled in schedulers in
> /sysblock/device/queue/scheduler sounds like just giving
> sysadmins too much rope.

Yes, that is debatable. But the above "one write per zone" trick can be also
handled by an application, which makes using any other scheduler OK. Look at the
recent SMR support code in fio from Bart. Some of the tests (in t/zbd) for some
I/O patterns are just fine with any scheduler. In fact any pattern is OK because
fio is SMR aware and never issues more than one write per zone. That is however
true only and only for I/O sizes that are small enough as to not cause the
kernel to generate multiple BIOs for each I/O call. Otherwise, deadline &
mq-deadline become necessary.

But I agree, it is a little fragile. Application developer and sysadmins really
need to know what will be running on the disk to make the right choice. And
knowing that is not necessarily straightforward.

Best regards.

-- 
Damien Le Moal
Western Digital Research

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/