Re: ack vs commit

Matt Benjamin <mbenjamin@xxxxxxxxxx> · Thu, 3 Dec 2015 21:45:59 -0500 (EST)

Hi,

I'm not really comfortable with the characterization of consistency in either position articulated here, in particular as applied to non-Ceph protocols such as NFS, AFS, and DCE (to take just the three DFS designs I have most experience with).  For example, the NFS concept of write stability really means write stability, it is not satisfied by ack.  Meanwhile, the NFS "close-to-open" consistency model isn't materially helped by ack (and has always seemed to me a perverse amalgam of POSIX and 'randomly relaxed' behavior to make servers appear faster), which may be similar to your opinion about NFS.

In the first instance, I have a longstanding interest in exploring alternative consistency models, but in particular, I am interested in fine-grained consistency selection under application control.  I'm fairly certain that the current, fixed, ack vs. commit distinction is not a sufficient set of protocol primitives on which to build all interesting consistency models I could think of.  (I'm not even certain that it is really an orthogonal primitive wrt consistency, as above, and per Sage's remarks wrt CephFS).  The current ack+commit system, in other words, seems like not capable of satisfying the requirements for assembly of an attractive range of consistency models.  The ack state is apparently useful for committing the participants to Ceph's current strong ordering guarantees, while not assuring the client that operations are durable.

I apologize in advance for errors in fact or logic :).  (Also for top-posting, I wasn't sure where to snip.)

Matt

----- Original Message -----
> From: "Gregory Farnum" <gfarnum@xxxxxxxxxx>
> To: "Sage Weil" <sweil@xxxxxxxxxx>
> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
> Sent: Thursday, December 3, 2015 8:39:04 PM
> Subject: Re: ack vs commit
> 
> On Thu, Dec 3, 2015 at 4:54 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > From the beginning Ceph has had two kinds of acks for rados write/update
> > operations: ack (indicating the operation is accepted, serialized, and
> > staged in the osd's buffer cache) and commit (indicating the write is
> > durable).  The client, if it saw a failure on the OSD before getting the
> > commit, would have to resend the write to the new or recovered OSD,
> > attaching the version it got with the original ack so that the
> > new/recovered OSD could preserve the order.  (This is what the
> > replay_window pool property is for.)
> >
> > The only user for this functionality is CephFS, and only when a file is
> > opened for write or read/write by multiple clients.  When that happens,
> > the clients do all read/write operations synchronously with the OSDs, and
> > effectively things are ordered there, at the object.  The idea is/was to
> > avoid penalizing this sort of IO by waiting for commits when the clients
> > aren't actually asking for that--at least until they call f[data]sync, at
> > which point the client will block and wait for the second commit replies
> > to arrive.
> >
> > In practice, this doesn't actually help: everybody runs the OSDs on XFS,
> > and FileStore has to do write-ahead journaling, which means that these
> > operations are committed and durable before they are readable, and the
> > clients only ever see 'commit' (and not 'ack' followed by 'commit')
> > (commit implies ack too).   If you run the OSD on btrfs, you might see
> > some acks (journaling and writes to the file system/page cache are done in
> > parallel).
> >
> > With newstore/bluestore, the current implementation doesn't make any
> > attempt to make a write readable before it is durable.  This could be
> > changed, but.. I'm not sure it's worth it.  For most workloads (RBD, RGW)
> > we only care about making writes durable as quickly as possible, and all
> > of the OSD optimization efforts are focusing on making this as fast as
> > possible.  Is there much marginal benefit to the initial ack?  It's only
> > the CephFS clients with the same file open from multiple hosts that might
> > see a small benefit.
> >
> > On the other hand, there is a bunch of code in the OSD and on the
> > client side to deal with the dual-ack behavior we could potentially drop.
> > Also, we are generating lots of extra messages on the backend network for
> > both ack and commit, even though the librados users don't usually care
> > (the clients don't get the dual acks unless they request them).
> >
> > Also, it is arguably Wrong and a Bad Thing that you could have client A
> > write to file F, client B reads from file F and sees that data, the OSD
> > and client A both fail, and when things recover client B re-reads the same
> > portion of the file and sees the file content from before client A's
> > change.  The MDS is extremely careful about this on the metadata side: no
> > side-effects of one client are visible to any other client until they are
> > durable, so that a combination MDS and client failure will never make
> > things appear to go back in time.
> >
> > Any opinions here?  My inclination is to remove the functionality (less
> > code, less complexity, more sane semantics), but we'd be closing the door
> > on what might have been a half-decent idea (separating serialization from
> > durability when multiple clients have the same file open for
> > read/write)...
> 
> I've considered this briefly in the past, but I'd really rather we keep it:
> 
> 1) While we don't make much use of it right now, I think it's a useful
> feature for raw RADOS users
> 
> 2) It's an incredibly useful protocol semantic for future performance
> work of the sort that makes Sam start to cry, but which I find very
> interesting. Consider a future when RBD treats the OSDs more like a
> disk with a cache, and is able to send out operations to get them out
> of local memory, without forcing them instantly to permanent storage.
> Similarly, I think as soon as we have a backend that lets us use a bit
> of 3D Crosspoint in a system, we'll wish we had this functionality
> again.
> (Likewise with CephFS, where many many users will have different
> consistency expectations thanks to NFS and other parallel FSes which
> really aren't consistent.)
> 
> Maybe we just think it's not worth it and we'd rather throw this out.
> But the only complexity this really lets us drop is the OSD replay
> stuff (which realistically I can't assess) — the dual acks is not hard
> to work with, and doesn't get changed much anyway. If we drop this now
> and decide to bring back any similar semantic in the future I think
> that'll be a lot harder than simply carrying it, both in terms of
> banging it back into the code, and especially in terms of getting it
> deployed to all the clients in the world again.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html