Re: [PATCH 3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 29 Nov 2021 10:40:38 -0800

On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote:
>
> That's what this series does when it probes the whole range in
> fault_in_writeable(). The main reason was that it's more efficient to do
> a read than a write on a large range (the latter dirtying the cache
> lines).

The more this thread goes on, the more I'm starting to think that we
should just make "fault_in_writable()" (and readable, of course) only
really work on the beginning of the area.

Not just for the finer-granularity pointer color probing, but for the
page probing too.

I'm looking at our current fault_in_writeable(), and I'm going

 (a) it uses __put_user() without range checks, which is really not great

 (b) it looks like a disaster from another standpoint: essentially
user-controlled loop size with no limit checking, no preemption, and
no check for fatal signals.

Now, (a) should be fixed with a access_ok() or similar.

And (b) can easily be fixed multiple ways, with one option simply just
being adding a can_resched() call and checking for fatal signals.

But faulting in the whole region is actually fundamentally wrong in
low-memory situations - the beginning of the region might be swapped
out by the time we get to the end. That's unlikely to be a problem in
real life, but it's an example of how it's simply not conceptually
sensible.

So I do wonder why we don't just say "fault_in_writable will fault in
_at_most_ X bytes", and simply limit the actual fault-in size to
something reasonable.

That solves _all_ the problems. It solves the lack of preemption and
fatal signals (by virtue of just limiting the amount of work we do).
It solves the low memory situation. And it solves the "excessive dirty
cachelines" case too.

Of course, we want to have some minimum bytes we fault in too, but
that minimum range might well be "we guarantee at least a full page
worth of data" (and in practice make it a couple of pages).

It's not like fault_in_writeable() avoids page faults or anything like
that - it just moves them around. So there's really very little reason
to fault in a large range, and there are multiple reasons _not_ to do
it.

Hmm?

               Linus