I was present for a discussion about allowing EC overwrites and thought it would be good to summarize it for the list: Commit Protocol: 1) client sends write to primary 2) primary reads in partial stripes needed for partial stripe overwrites from replicas 3) primary sends prepares to participating replicas and queues its own prepare locally 4) once all prepares are complete, primary sends a commit to the client 5) primary sends applies to all participating replicas When we get the prepare, we write out a temp object with the data to be written. On apply, we use an objectstore primitive to atomically move those extents into the actual object. The log entry contains the name/id for the temp object so it can be applied on apply or removed on rollback. Each log entry contains a list of the shard ids modified. During peering, we use the same protocol for choosing the authoritative log for the existing EC pool, except that we first take the longest candidate log and use it to extend shorter logs until they hit an entry they should have witnessed, but didn't. Implicit in the above scheme is the fact that if an object is written, but a particular shard isn't changed, the osd with that shard will have a copy of the object with the correct data, but an out of date object_into (notably, the version will be old). To avoid this messing up the missing set during the log scan on OSD start, we'll skip log entries we wouldn't have participated in (we may not even choose to store them, see below). This does generally pose a challenge for maintaining prior_version. It seems like it shouldn't be much of a problem since rollbacks can only happen on prepared log entries which haven't been applied, so log merging can never result in a divergent entry causing a missing object. I think we can get by without it then? We can go further with the above and actually never persist a log entry on a shard which it did not participate in. As long as peering works correctly, the union of the logs we got must have all entries. The primary will need to keep a complete copy in memory to manage dup op detection and recovery, however. 2) above can also be done much more efficiently. Some codes will allow the parity chunks to be reconstructed more efficiently, so some thought will have to go into restructuring the plugin interface to allow more efficient -Sam -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html