Re: SEEK_DATA/SEEK_HOLE support

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 5 Oct 2011 20:36:15 +1100

On Wed, Oct 05, 2011 at 09:34:26AM +0200, Michael Monnerie wrote:
> On Mittwoch, 5. Oktober 2011 Dave Chinner wrote:
> > That will only work if you can prevent concurrent unwritten extent
> > conversion from happening while we do the separate tag lookups on
> > the range because it requires two radix tree tag lookups rather than
> > just one index lookup. i.e. miss the dirty page because it went
> > dirty->writeback during the dirty tag search, and miss the same page
> > when doing the writeback lookup because it went writeback->clean
> > very quickly due to IO completion.
> > 
> > So to stop that from happening, it requires that filesystems can
> > exclude unwritten extent conversion from happening while a
> > SEEK_HOLE/SEEK_DATA operation is in progress, and that the
> > filesystem can safely do mapping tree lookups while providing that
> > extent tree exclusion.  I know that XFS has no problems here, but
> > filesystems that use i_mutex for everything might be in trouble.
> > 
> > Besides, if we just look for pages in the cache over unwritten
> > extents (i.e. someone has treated it as data already), then it can
> > be done locklessly without having to worry about page state changes
> > occurring concurrently...
> 
> I'd like to understand why it's important to care about locking here. As 
> I understand it, SEEK_* is used for example to copy a file efficiently. 
> If that is performed on a file that is currently being written to, the 
> resulting copy will probably be bogus anyway, even without SEEK_* usage.

The problem is that copying a file with dirty data hot in the cache
concurrently when writeback to that file is occurring. The
application is not modifying the file, and the state changes that
could lead SEEK_DATA missing data are driven entirely by background
kernel threads. The application leaves a file in this state:

                     dirty page
              +------D--------+
               unwritten extent

SEEK_DATA must stop at that dirty page over the unwritten extent.
If we do a dirty page lookup, but the flusher thread has started IO
on it, it will have transitioned to this state:

                     writeback page
              +------W--------+
               unwritten extent

And the dirty page lookup will not find it. If we allow unwritten
extent conversion to also run while we are doing the writeback page
cache lookup, then IO completion and conversion can occur before we
find the writeback page in the cache. i.e we see this state:

                     clean page
              +------C--------+
               unwritten extent

And we don't find the data page over the unwritten extent at all
because we haven't looked up clean pages in the cache.

At that point, our SEEK_DATA call has skipped over a valid data
page, the copy utility fails to copy the data over the unwritten
extent, thereby silently creating a corrupt copy....

> There might be a case where it is important, but I can't see that atm. 

It's a data corruption problem, pure and simple....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs