On Thu, Nov 07, 2013 at 11:14:02AM -0500, Jeff Moyer wrote: > Chris Mason <chris.mason@xxxxxxxxxxxx> writes: > > >> Well, we have control over dm and md, so I'm not worried about that. > >> For the storage vendors, we'll have to see about influencing the > >> standards bodies. > >> > >> The way I see it, there are 3 pieces of information that are required: > >> 1) minimum size that is atomic (likely the physical block size, but > >> maybe the logical block size?) > >> 2) maximum size that is atomic (multiple of minimum size) > >> 3) whether or not discontiguous ranges are supported > >> > >> Did I miss anything? > > > > It'll vary from vendor to vendor. A discontig range of two 512KB areas > > is different from 256 distcontig 4KB areas. > > Sure. > > > And it's completely dependent on filesystem fragmentation. So, a given > > IO might pass for one file and fail for the next. > > Worse, it could pass for one region of a file and fail for a different > region of the same file. > > I guess you could export the most conservative estimate, based on > completely non-contiguous smallest sized segments. Things larger may > work, but they may not. Perhaps this would be too limiting, I don't > know. > > > In a DM/MD configuration, an atomic IO inside a single stripe on raid0 > > could succeed while it will fail if it spans two stripes to two > > different devices. > > I'd say that if you are spanning multiple devices, you don't support > O_ATOMIC. You could write a specific dm target that allows it, but I > don't think it's a priority to support it in the way your example does. I would have thought this would be pretty simple to do - just journal the atomic write so it can be recovered in full if there is a power failure. Indeed, what I'd really like to be able to do from a filesystem perspective is to be able to issue a group of related metadata IO as an atomic write rather than marshaling it through a journal and then issuing them as unrelated IO. If we have a special dm-target underneath that can either issue it as an atomic write (if the hardware supports it) or emulate it via a journal to maintain multi-device atomicity requirements then we end up with a general atomic write solution that filesystems can then depend on. Once we have guaranteed support for atomic writes, then we can completely remove journalling from filesystem transaction engines as the atomicity requirements can be met with atomic writes. An then we can optimise things like fsync() for atomic writes. IOWs, generic support for atomic writes will make a major difference to filesystem algorithms. Hence, from my perspective, at this early point in the development lifecycle having guaranteed atomic write support via emulation is far more important than actually having hardware that supports it... :) > Given that there are applications using your implementation, what did > they determine was a sane way to do things? Only access the block > device? Preallocate files? Fallback to non-atomic writes + fsync? > Something else? Avoiding these problems is, IMO, another goof reason for having generic, transparent support for atomic writes built into the IO stack.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html