On Thu, Dec 3, 2015 at 5:39 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Thu, Dec 3, 2015 at 4:54 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> From the beginning Ceph has had two kinds of acks for rados write/update >> operations: ack (indicating the operation is accepted, serialized, and >> staged in the osd's buffer cache) and commit (indicating the write is >> durable). The client, if it saw a failure on the OSD before getting the >> commit, would have to resend the write to the new or recovered OSD, >> attaching the version it got with the original ack so that the >> new/recovered OSD could preserve the order. (This is what the >> replay_window pool property is for.) >> >> The only user for this functionality is CephFS, and only when a file is >> opened for write or read/write by multiple clients. When that happens, >> the clients do all read/write operations synchronously with the OSDs, and >> effectively things are ordered there, at the object. The idea is/was to >> avoid penalizing this sort of IO by waiting for commits when the clients >> aren't actually asking for that--at least until they call f[data]sync, at >> which point the client will block and wait for the second commit replies >> to arrive. >> >> In practice, this doesn't actually help: everybody runs the OSDs on XFS, >> and FileStore has to do write-ahead journaling, which means that these >> operations are committed and durable before they are readable, and the >> clients only ever see 'commit' (and not 'ack' followed by 'commit') >> (commit implies ack too). If you run the OSD on btrfs, you might see >> some acks (journaling and writes to the file system/page cache are done in >> parallel). >> >> With newstore/bluestore, the current implementation doesn't make any >> attempt to make a write readable before it is durable. This could be >> changed, but.. I'm not sure it's worth it. For most workloads (RBD, RGW) >> we only care about making writes durable as quickly as possible, and all >> of the OSD optimization efforts are focusing on making this as fast as >> possible. Is there much marginal benefit to the initial ack? It's only >> the CephFS clients with the same file open from multiple hosts that might >> see a small benefit. >> >> On the other hand, there is a bunch of code in the OSD and on the >> client side to deal with the dual-ack behavior we could potentially drop. >> Also, we are generating lots of extra messages on the backend network for >> both ack and commit, even though the librados users don't usually care >> (the clients don't get the dual acks unless they request them). >> >> Also, it is arguably Wrong and a Bad Thing that you could have client A >> write to file F, client B reads from file F and sees that data, the OSD >> and client A both fail, and when things recover client B re-reads the same >> portion of the file and sees the file content from before client A's >> change. The MDS is extremely careful about this on the metadata side: no >> side-effects of one client are visible to any other client until they are >> durable, so that a combination MDS and client failure will never make >> things appear to go back in time. >> >> Any opinions here? My inclination is to remove the functionality (less >> code, less complexity, more sane semantics), but we'd be closing the door >> on what might have been a half-decent idea (separating serialization from >> durability when multiple clients have the same file open for >> read/write)... > > I've considered this briefly in the past, but I'd really rather we keep it: > > 1) While we don't make much use of it right now, I think it's a useful > feature for raw RADOS users > FWIW, I completely agree. The fact that we hardly use it for cephfs/rgw/rbd, does not mean that there aren't any other users for it. When breaking the framework is considered, we should weigh whether we add to it or detract from it. I think that in this case it would (at least seemingly) be the latter. Yehuda > 2) It's an incredibly useful protocol semantic for future performance > work of the sort that makes Sam start to cry, but which I find very > interesting. Consider a future when RBD treats the OSDs more like a > disk with a cache, and is able to send out operations to get them out > of local memory, without forcing them instantly to permanent storage. > Similarly, I think as soon as we have a backend that lets us use a bit > of 3D Crosspoint in a system, we'll wish we had this functionality > again. > (Likewise with CephFS, where many many users will have different > consistency expectations thanks to NFS and other parallel FSes which > really aren't consistent.) > > Maybe we just think it's not worth it and we'd rather throw this out. > But the only complexity this really lets us drop is the OSD replay > stuff (which realistically I can't assess) — the dual acks is not hard > to work with, and doesn't get changed much anyway. If we drop this now > and decide to bring back any similar semantic in the future I think > that'll be a lot harder than simply carrying it, both in terms of > banging it back into the code, and especially in terms of getting it > deployed to all the clients in the world again. > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html