Re: [PATCH 1/2] block: Add support for atomic writes

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 14 Nov 2013 10:59:14 +1100

On Thu, Nov 07, 2013 at 11:14:02AM -0500, Jeff Moyer wrote:
> Chris Mason <chris.mason@xxxxxxxxxxxx> writes:
> 
> >> Well, we have control over dm and md, so I'm not worried about that.
> >> For the storage vendors, we'll have to see about influencing the
> >> standards bodies.
> >> 
> >> The way I see it, there are 3 pieces of information that are required:
> >> 1) minimum size that is atomic (likely the physical block size, but
> >>    maybe the logical block size?)
> >> 2) maximum size that is atomic (multiple of minimum size)
> >> 3) whether or not discontiguous ranges are supported
> >> 
> >> Did I miss anything?
> >
> > It'll vary from vendor to vendor.  A discontig range of two 512KB areas
> > is different from 256 distcontig 4KB areas.
> 
> Sure.
> 
> > And it's completely dependent on filesystem fragmentation.  So, a given
> > IO might pass for one file and fail for the next.
> 
> Worse, it could pass for one region of a file and fail for a different
> region of the same file.
> 
> I guess you could export the most conservative estimate, based on
> completely non-contiguous smallest sized segments.  Things larger may
> work, but they may not.  Perhaps this would be too limiting, I don't
> know.
> 
> > In a DM/MD configuration, an atomic IO inside a single stripe on raid0
> > could succeed while it will fail if it spans two stripes to two
> > different devices.
> 
> I'd say that if you are spanning multiple devices, you don't support
> O_ATOMIC.  You could write a specific dm target that allows it, but I
> don't think it's a priority to support it in the way your example does.

I would have thought this would be pretty simple to do - just
journal the atomic write so it can be recovered in full if there is
a power failure.

Indeed, what I'd really like to be able to do from a filesystem
perspective is to be able to issue a group of related metadata IO as
an atomic write rather than marshaling it through a journal and then
issuing them as unrelated IO. If we have a special dm-target
underneath that can either issue it as an atomic write (if the
hardware supports it) or emulate it via a journal to maintain
multi-device atomicity requirements then we end up with a general
atomic write solution that filesystems can then depend on.

Once we have guaranteed support for atomic writes, then we can 
completely remove journalling from filesystem transaction engines
as the atomicity requirements can be met with atomic writes. An then
we can optimise things like fsync() for atomic writes.

IOWs, generic support for atomic writes will make a major difference
to filesystem algorithms. Hence, from my perspective, at this early
point in the development lifecycle having guaranteed atomic write
support via emulation is far more important than actually having
hardware that supports it... :)

> Given that there are applications using your implementation, what did
> they determine was a sane way to do things?  Only access the block
> device?  Preallocate files?  Fallback to non-atomic writes + fsync?
> Something else?

Avoiding these problems is, IMO, another goof reason for having
generic, transparent support for atomic writes built into the IO
stack....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html