Re: Questions on block drivers, REQ_FLUSH and REQ_FUA

Alex Bligh <alex@xxxxxxxxxxx> · Wed, 25 May 2011 09:06:20 +0100

--On 24 May 2011 18:32:20 -0400 Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:

On Tue, May 24, 2011 at 10:29:09PM +0100, Alex Bligh wrote:

[..]
Q3: Apparently there are no longer concepts of barriers, just REQ_FLUSH
and REQ_FUA. REQ_FLUSH guarantees all "completed" I/O requests are
written to disk prior to that BIO starting. However, what about
non-completed I/O requests? For instance, is the following legitimate:

[see diagram duplicated below, snipped to save space]

Here WRITE1 was not 'completed', and thus by the text of
Documentation/writeback_cache_control.txt, need not be written to disk
before starting WRITE3 (which had REQ_FLUSH attached).
...
I presume this is illegal and is a documentation issue.

I know very little about flush semantics but still try to answer two
of your questions.

I think documentation is fine. It specifically talks about completed
requests. The requests which have been sent to drive (and may be in
controller's cache).

So in above example, if driver holds back WRITE1 and never signals
the completion of request, then I think it is fine to complete
the WRITE3+FLUSH ahead of WRITE1.

I think issue will arise only if you signaled that WRITE1 has completed
and cached it in driver (as you seem to indicating) and never sent to the
drive and then you received WRITE3 + FLUSH requests. In that case you
shall have to make sure that by the time WRITE3 + FLUSH completion is
signaled, WRITE1 is on the disk.

That conforms to the documentation, but the reason why I think it
is unlikely is that from the kernel's point of view, there is
no difference in effect between what I suggested:

      Receive        Send to disk         Reply
      =======        ============         =====
      WRITE1
      WRITE2
                                          WRITE2 (cached)
      FLUSH+WRITE3
                     WRITE2
                     WRITE3
                                          WRITE3
      WRITE4
                     WRITE4
                                          WRITE4
                     WRITE1
                                          WRITE1

and what the kernel is trying to avoid:

      Receive        Send to disk         Reply
      =======        ============         =====
      WRITE1 (processed write1, send to writeback cache, do not reply)
      WRITE2
                                          WRITE2 (cached)
      FLUSH+WRITE3
                     WRITE2
                     WRITE3
                                          WRITE3
      WRITE4
                     WRITE4
                                          WRITE4
                     WRITE1
                                          WRITE1

IE I can't see how a strict reading of the specification gains the
kernel anything.

IIUC, you are right. You can finish WRITE4 before completing FLUSH+WRITE3
here.

We just need to make sure that any request completed by the driver
is on disk by the time FLUSH+WRITE3 completes.

OK, that's less surprising as the kernel still gains something.

Are you writing a bio based driver? For a request based driver request
queue should break down FLUSH + WRITE3 request in two parts. Issue FLUSH
first and when that completes, issue WRITE3.

Currently it's request-based (in fact the kernel bit of it is based on nbd
at the moment), though I could rewrite to make it bio based.

The characteristics I have are: large variance in time to complete a given
operation, desirability of ordering of requests by block number (i.e.
elevator is useful to me), large operations very disporportionately cheaper
than small ones, parallelisation of requests gives huge benefits (i.e. I
can write many many many blocks in parallel).

If a request-based driver is a bad structure, I could relatively easily
rewrite (it mostly lives in userland at the moment, and the kernel bit is
quite small). We'd get a bio-based nbd out of it too for free (I have no
idea whether that would be an advantage though I note loop has gone
make_request_function based).

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html