Re: Kernel Benchmarking

Dave Chinner <david@xxxxxxxxxxxxx> · Sun, 13 Sep 2020 10:40:57 +1000

On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:

[...]

> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
> 
> The reason the apache benchmark regresses is that it basically does a
> web server test with a single file ("test.html") that gets served by
> just mmap'ing it, and sending it out that way. Using lots of threads,
> and using lots of different mappings. So they *all* fault on the read
> of that page, and they *all* do that "lock page, check that the
> mapping is valid, insert page" dance.

Hmmmm. So this is a typically a truncate race check, but this isn't
sufficient to protect the fault against all page invalidation races
as the page can be re-inserted into the same mapping at a different
page->index now within EOF.

Hence filesystems that support hole punching have to serialise the
->fault path against the page invalidations done in ->fallocate
operations because they otherwise we get data corruption from the
mm/ truncate checks failing to detect invalidated pages within EOF
correctly.

i.e. truncate/hole punch is a multi-object modification operation,
with the typical atomicity boundary of the operation defined by the
inode_lock() and/or the filesystem transaction that makes the
modification. IOWs, page_lock() based truncation/invalidation checks
aren't atomic w.r.t. the other objects being modified in the same
operation. Truncate avoids this by the ordering the file size update
vs the page cache invalidation, but no such ordering protection can
be provided for ->fallocate() operations that directly manipulate
the metadata of user data in the file.

> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
> 
>  (a) just revert
>  (b) add some busy-spinning
>  (c) reader-writer page lock
>  (d) try to de-emphasize the page lock

....

> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
>
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.

*nod*

Given that:

1) we have been doing (d) for 5 years (see commit 653c60b633a ("xfs:
introduce mmap/truncate lock")),

2) ext4 also has this same functionality,

3) DAX requires the ability for filesystems to exclude page faults

4) it is a widely deployed and tested solution

5) filesystems will still need to be able to exclude page faults
over a file range while they directly manipulate file metadata to
change the user data in the file

> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.

Right.  Even if we go for (c), AFAICT we still need (d) because we
still (d) largely because of reason (5) above.  There are a whole
class of "page fault vs direct storage manipulation offload"
serialisation issues that filesystems have to consider (especially
if they want to support DAX), so if we can use that same mechanism
to knock a whole bunch of page locking out of the fault paths then
that seems like a win to me....

> Any other suggestions than those (a)-(d) ones above?

Not really - I've been advocating for (d) as the general mechanism
for truncate/holepunch exclusion for quite a few years now because
it largely seems to work with no obvious/apparent issues.

Just as a FWIW: I agree that the per-inode rwsem could be an issue
here, jsut as it is for the IO path.  As a side project I'm working
on shared/exclusive range locks for the XFS inode to replace the
rwsems for the XFS_IOLOCK_{SHARED,EXCL} and the
XFS_MMAPLOCK_{SHARED,EXCL}.

That will largely alleviate any problems that "per-inode rwsem"
serialisation migh cause us here - I've got the DIO fastpath down to
2 atomic operations per lock/unlock - it's with 10% of rwsems up to
approx. half a million concurrent DIO read/writes to the same inode.
Concurrent buffered read/write are not far behind direct IO until I
run out of CPU to copy data. None of this requires changes to
anything outside fs/xfs because everythign is already correctly serialised
to "atomic" filesystem operations and range locking preserves the
atomicity of those operations including all the page cache
operations done within them.

Hence I'd much prefer to be moving the page cache in a direction
that results in the page cache not having to care at all about
serialising against racing truncates, hole punches or anythign else
that runs page invalidation. That will make the page cache code
simpler, require less locking, and likely have less invalidation
related issues over time...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx