Re: hole-punch vs fault

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 2 Dec 2013 08:52:21 +1100

On Fri, Nov 29, 2013 at 06:11:55AM -0700, Matthew Wilcox wrote:
> On Fri, Nov 29, 2013 at 09:12:46AM +1100, Dave Chinner wrote:
> > Indeed, we've already discussed this at length - if we are going to
> > introduce a new method of synchronisation between mmap and
> > holepunch, then we cannot ignore all the other mmap vs IO
> > synchronisation problems we have that are caused by the same issue.
> > Hacking workarounds into the holepunch/mmap paths is not a robust or
> > maintainable solution.
> 
> This isn't hacking in a workaround.  This is a solution to the problem
> that assumes holepunches are rare, and it's OK to make the fault path retry
> if a holepunch happened. 

Workloads that use hole punches tend to make extensive use of them,
so I suspect that what is OK for you is going to be a fairly
significant problem for others...

> I agree it needs to be documented and the exact
> details of what is protected by i_damaged_mutex needs to be clear.
> 
> > 	a) it is a entire file lock and truncate/holepunch do not
> > 	   affect the entire scope of the file. i.e. there is no
> > 	   reason why holepunch cannot run concurrently with IO if
> > 	   they are to different ranges of a file.
> 
> Yes, if we care about the scalability of truncate and holepunch.  I really
> don't think that's an important workload.

See my comment above about workloads that use holepunch often care
about scalability. i.e. Workloads that use preallocation and hole
punching typically care about IO scalability.

For example, sparse image files on loopback devices where
mounted filesystems use discard to run hole punches to ensure
mimimal space usage. We most certainly care about the scalability of
hole punching w.r.t. other IO going to the same file (probably
direct IO) in these workloads. Right now a hole punch is effectively
an unqueuable operation - it causes a pipeline stall - just like
TRIM on SATA....

> > 	b) it increases the struct inode by at least 5%. If we are
> > 	   going to increase the inode by that much, we better make
> > 	   sure we solve more than one isolated, relatively rare
> > 	   corner case behaviour.
> 
> About 7% (568 -> 608 bytes) with my .config (SMP, SPIN_ON_OWNER,
> !DEBUG_MUTEXES, !DEBUG_LOCK_ALLOC).  I rearranged the order (thanks,
> pahole!) to put i_damage in the hole created by private_lock.

Which is a huge price to pay for solving a problem that the majority
of users will never hit or care about. As I say to everyone that
thinks they can just add stuff to the struct inode without caring
about the size increase: use filesystem developers will kill to save
4 bytes in the struct inode, so you better have a damn good reason
for adding 40+ bytes to the structure....

> I'm solving two cases here, one is holepunch vs fault for regular
> files.  The other is truncate vs fault for XIP files, which is
> probably a little less rare.  I think the same mechanism can be
> used when a filesystem is doing a defragment or otherwise moving
> blocks around on an XIP block device.

Well, you kind of raised that as a secondary issue in passing, not
that it was the problem you actually need to solve.

> Look, I'm not attached to the i_damage solution,
> but I do need something to protect against truncate vs fault on XIP.
> Fixing holepunch vs fault too just felt like the right thing to do.

So, your real problem is that ext2-xip has a truncate vs fault race
because of a lack of a page lock to serialise page faults against
truncate. In fact, it appears to bypass the page cache completely
for both reads and writes (writes go directly to block device
memory), so it would seem that all the assumptions we've made about
serialisation on page locks throughout every filesystem are not
valid for XIP?

That seems like an XIP design fault, yes?  Why should we be
hacking crap into the struct inode to fix a XIP design flaw?

Indeed, XIP could make use of the structures we already
have in the struct inode/address space to behave more like a normal
filesystem. We already use the mapping tree to store
per-page information that is used to serialise truncate vs page
faults, so why not make XIP do exactly the same thing? 

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html