>From the beginning Ceph has had two kinds of acks for rados write/update operations: ack (indicating the operation is accepted, serialized, and staged in the osd's buffer cache) and commit (indicating the write is durable). The client, if it saw a failure on the OSD before getting the commit, would have to resend the write to the new or recovered OSD, attaching the version it got with the original ack so that the new/recovered OSD could preserve the order. (This is what the replay_window pool property is for.) The only user for this functionality is CephFS, and only when a file is opened for write or read/write by multiple clients. When that happens, the clients do all read/write operations synchronously with the OSDs, and effectively things are ordered there, at the object. The idea is/was to avoid penalizing this sort of IO by waiting for commits when the clients aren't actually asking for that--at least until they call f[data]sync, at which point the client will block and wait for the second commit replies to arrive. In practice, this doesn't actually help: everybody runs the OSDs on XFS, and FileStore has to do write-ahead journaling, which means that these operations are committed and durable before they are readable, and the clients only ever see 'commit' (and not 'ack' followed by 'commit') (commit implies ack too). If you run the OSD on btrfs, you might see some acks (journaling and writes to the file system/page cache are done in parallel). With newstore/bluestore, the current implementation doesn't make any attempt to make a write readable before it is durable. This could be changed, but.. I'm not sure it's worth it. For most workloads (RBD, RGW) we only care about making writes durable as quickly as possible, and all of the OSD optimization efforts are focusing on making this as fast as possible. Is there much marginal benefit to the initial ack? It's only the CephFS clients with the same file open from multiple hosts that might see a small benefit. On the other hand, there is a bunch of code in the OSD and on the client side to deal with the dual-ack behavior we could potentially drop. Also, we are generating lots of extra messages on the backend network for both ack and commit, even though the librados users don't usually care (the clients don't get the dual acks unless they request them). Also, it is arguably Wrong and a Bad Thing that you could have client A write to file F, client B reads from file F and sees that data, the OSD and client A both fail, and when things recover client B re-reads the same portion of the file and sees the file content from before client A's change. The MDS is extremely careful about this on the metadata side: no side-effects of one client are visible to any other client until they are durable, so that a combination MDS and client failure will never make things appear to go back in time. Any opinions here? My inclination is to remove the functionality (less code, less complexity, more sane semantics), but we'd be closing the door on what might have been a half-decent idea (separating serialization from durability when multiple clients have the same file open for read/write)... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html