> I don't understand the Jeff snippet above - if they are > non-overlapping writes to dfferent offsets, this would never happen. The question is not whether it *would* happen, but whether it would be *allowed* to happen, and my point is that POSIX is often a poor guide. Sometimes it's unreasonably strict, sometimes it's very lax. That said, my example was kind of bad because it doesn't actually work unless issues of durability are brought in. Let's say that there's a crash between the writes and the reads. (It's not even clear when POSIX would consider a distributed system to have crashed. Let's just say *everything* dies.) While the strict write requirements apply to the non-durable state before it's flushed, and thus affect what gets flushed when writes overlap, it's entirely permissible for non-overlapping writes to be flushed out of order. This is even quite likely if the writes are on different file descriptors. http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html > If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily > on the conformance document to tell the user what can be expected from > the system. It is explicitly intended that a null implementation is > permitted. That's my absolute favorite part of POSIX, by the way. It amounts to "do whatever you want" in standards language. What this really means is that, when the system comes back up, the results of the second write could be available even though the first was lost. I'm not saying it happens. I'm not saying it's good or useful behavior. I'm just saying the standard permits it. > If they are the same offset and at the same time, then you can have an > undefined results where you might get fragments of A and fragments of > B (where you might be able to see some odd things if the write spans > pages/blocks). This is where POSIX goes the other way and *over*specifies behavior. Normal linearizability requires that an action appear to be atomic at *some* point between issuance and completion. However, the POSIX "after a write" wording forces this to be at the exact moment of completion. It's not undefined. If two writes overlap in both space and time, the one that completes last *must* win. Those "odd things" you mention might be considered non-conformance with the standard. Fortunately, Linux is not POSIX. Linus and others have been quite clear on that. As much as I've talked about formal standards here, "what you can get away with" is the real standard. The page-cache behavior that local filesystems rely on is IMO a poor guide, because extending that behavior across physical systems is difficult to do completely and impossible to do without impacting performance. What matters is whether users will accept this kind of reordering. Here's what I think: (1) An expectation of ordering is only valid if the order is completely unambiguous. (2) This can only be the case if there was some coordination between when the first write completes and when the second is issued. (3) The coordinating entities could be on different machines, in which case the potential for reordering is unavoidable (short of us adding write-behind serialization across all clients). (4) If it's unavoidable in the distributed case, there's not much value in trying to make it airtight in the local case. In other words, standards aside, I'm kind of with Raghavendra on this. We shouldn't add this much complexity and possibly degrade performance unless we can provide a *meaningful guarantee* to users, and this area is already such a swamp that any user relying on particular behavior is likely to get themselves in trouble no matter what we do. _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel