Re: ack vs commit

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Thu, 3 Dec 2015 19:56:50 -0800

On Thu, Dec 3, 2015 at 5:39 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Thu, Dec 3, 2015 at 4:54 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> From the beginning Ceph has had two kinds of acks for rados write/update
>> operations: ack (indicating the operation is accepted, serialized, and
>> staged in the osd's buffer cache) and commit (indicating the write is
>> durable).  The client, if it saw a failure on the OSD before getting the
>> commit, would have to resend the write to the new or recovered OSD,
>> attaching the version it got with the original ack so that the
>> new/recovered OSD could preserve the order.  (This is what the
>> replay_window pool property is for.)
>>
>> The only user for this functionality is CephFS, and only when a file is
>> opened for write or read/write by multiple clients.  When that happens,
>> the clients do all read/write operations synchronously with the OSDs, and
>> effectively things are ordered there, at the object.  The idea is/was to
>> avoid penalizing this sort of IO by waiting for commits when the clients
>> aren't actually asking for that--at least until they call f[data]sync, at
>> which point the client will block and wait for the second commit replies
>> to arrive.
>>
>> In practice, this doesn't actually help: everybody runs the OSDs on XFS,
>> and FileStore has to do write-ahead journaling, which means that these
>> operations are committed and durable before they are readable, and the
>> clients only ever see 'commit' (and not 'ack' followed by 'commit')
>> (commit implies ack too).   If you run the OSD on btrfs, you might see
>> some acks (journaling and writes to the file system/page cache are done in
>> parallel).
>>
>> With newstore/bluestore, the current implementation doesn't make any
>> attempt to make a write readable before it is durable.  This could be
>> changed, but.. I'm not sure it's worth it.  For most workloads (RBD, RGW)
>> we only care about making writes durable as quickly as possible, and all
>> of the OSD optimization efforts are focusing on making this as fast as
>> possible.  Is there much marginal benefit to the initial ack?  It's only
>> the CephFS clients with the same file open from multiple hosts that might
>> see a small benefit.
>>
>> On the other hand, there is a bunch of code in the OSD and on the
>> client side to deal with the dual-ack behavior we could potentially drop.
>> Also, we are generating lots of extra messages on the backend network for
>> both ack and commit, even though the librados users don't usually care
>> (the clients don't get the dual acks unless they request them).
>>
>> Also, it is arguably Wrong and a Bad Thing that you could have client A
>> write to file F, client B reads from file F and sees that data, the OSD
>> and client A both fail, and when things recover client B re-reads the same
>> portion of the file and sees the file content from before client A's
>> change.  The MDS is extremely careful about this on the metadata side: no
>> side-effects of one client are visible to any other client until they are
>> durable, so that a combination MDS and client failure will never make
>> things appear to go back in time.
>>
>> Any opinions here?  My inclination is to remove the functionality (less
>> code, less complexity, more sane semantics), but we'd be closing the door
>> on what might have been a half-decent idea (separating serialization from
>> durability when multiple clients have the same file open for
>> read/write)...
>
> I've considered this briefly in the past, but I'd really rather we keep it:
>
> 1) While we don't make much use of it right now, I think it's a useful
> feature for raw RADOS users
>

FWIW, I completely agree. The fact that we hardly use it for
cephfs/rgw/rbd, does not mean that there aren't any other users for
it.  When breaking the framework is considered, we should weigh
whether we add to it or detract from it. I think that in this case it
would (at least seemingly) be the latter.

Yehuda

> 2) It's an incredibly useful protocol semantic for future performance
> work of the sort that makes Sam start to cry, but which I find very
> interesting. Consider a future when RBD treats the OSDs more like a
> disk with a cache, and is able to send out operations to get them out
> of local memory, without forcing them instantly to permanent storage.
> Similarly, I think as soon as we have a backend that lets us use a bit
> of 3D Crosspoint in a system, we'll wish we had this functionality
> again.
> (Likewise with CephFS, where many many users will have different
> consistency expectations thanks to NFS and other parallel FSes which
> really aren't consistent.)
>
> Maybe we just think it's not worth it and we'd rather throw this out.
> But the only complexity this really lets us drop is the OSD replay
> stuff (which realistically I can't assess) — the dual acks is not hard
> to work with, and doesn't get changed much anyway. If we drop this now
> and decide to bring back any similar semantic in the future I think
> that'll be a lot harder than simply carrying it, both in terms of
> banging it back into the code, and especially in terms of getting it
> deployed to all the clients in the world again.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html