Re: mm: shm: hang in shmem_fallocate

Vlastimil Babka <vbabka@xxxxxxx> · Thu, 26 Jun 2014 17:19:25 +0200

On 06/26/2014 11:14 AM, Vlastimil Babka wrote:
On 06/26/2014 12:36 AM, Hugh Dickins wrote:
On Tue, 24 Jun 2014, Vlastimil Babka wrote:
On 06/16/2014 04:29 AM, Hugh Dickins wrote:
On Thu, 12 Jun 2014, Sasha Levin wrote:
On 02/09/2014 08:41 PM, Sasha Levin wrote:
On 02/08/2014 10:25 PM, Hugh Dickins wrote:
Would trinity be likely to have a thread or process repeatedly faulting
in pages from the hole while it is being punched?

I can see how trinity would do that, but just to be certain - Cc davej.

On 02/08/2014 10:25 PM, Hugh Dickins wrote:
Does this happen with other holepunch filesystems?  If it does not,
I'd suppose it's because the tmpfs fault-in-newly-created-page path
is lighter than a consistent disk-based filesystem's has to be.
But we don't want to make the tmpfs path heavier to match them.

No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
punching in other filesystems and I make sure to get a bunch of those
mounted before starting testing.

Just pinging this one again. I still see hangs in -next where the hang
location looks same as before:


Please give this patch a try.  It fixes what I can reproduce, but given
your unexplained page_mapped() BUG in this area, we know there's more
yet to be understood, so perhaps this patch won't do enough for you.


Hi,

Sorry for the slow response: I have got confused, learnt more, and
changed my mind, several times in the course of replying to you.
I think this reply will be stable... though not final.

Thanks a lot for looking into it!


since this got a CVE,

Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.

Sorry, I should have mentioned it explicitly.

Looks overrated to me

I'd bet it would pass unnoticed if you didn't use the sentence "but
whether it's a serious matter in the scale of denials of service, I'm
not so sure" in your first reply to Sasha's report :) I wouldn't be
surprised if people grep for this.

(and amusing to see my pompous words about a
"range notification mechanism" taken too seriously), but of course
we do need to address it.

I've been looking at backport to an older kernel where

Thanks a lot for looking into it.  I didn't think it was worth a
Cc: stable@xxxxxxxxxxxxxxx myself, but admit to being both naive
and inconsistent about that.

fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
range notification mechanism yet. There's just madvise(MADV_REMOVE) and since

Yes, that mechanism could be ported back pre-v3.5,
but I agree with your preference not to.

it doesn't guarantee anything, it seems simpler just to give up retrying to

Right, I don't think we have formally documented the instant of "full hole"
that I strove for there, and it's probably not externally verifiable, nor
guaranteed by other filesystems.  I just thought it a good QoS aim, but
it has given us this problem.

truncate really everything. Then I realized that maybe it would work for
current kernel as well, without having to add any checks in the page fault
path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
the old data from the range, it's fine from any information leak point of view.
If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
page before the parallel truncate has ended, or right after it has ended.

Yes.  I disagree with your actual patch, for more than one reason,
but it's in the right area; and I found myself growing to agree with
you, that's it's better to have one kind of fix for all these releases,
than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
v3.0 too, I'm sceptical about that, but haven't tried it as yet.)

I was looking at our 3.0 based kernel, but it could be due to backported
patches on top.

OK, seems I cannot reproduce this on 3.0.101 vanilla.

If I'd realized that we were going to have to backport, I'd have spent
longer looking for a patch like yours originally.  So my inclination
now is to go your route, make a new patch for v3.16 and backports,
and revert the f00cdc6df7d7 that has already gone in.

So I'm posting it here as a RFC. I haven't thought about the
i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy

My understanding is that i915_gem_object_truncate() is not a problem,
that i915's dev->struct_mutex serializes all its relevant transitions,
plus the object woudn't even be interestingly accessible to the user.

the new "lstart < inode->i_size" condition, but I don't know if it's "vulnerable"
to the problem.

I don't think i915 is vulnerable, but if it is, that condition would
be fine for it, as would be the patch I'm now thinking of.


-----8<-----
From: Vlastimil Babka <vbabka@xxxxxxx>
Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole punching

---
   mm/shmem.c | 19 +++++++++++++++++++
   1 file changed, 19 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index f484c27..6d6005c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
   		if (!pvec.nr) {
   			if (index == start || unfalloc)
   				break;
+                        /*
+                         * When this condition is true, it means we were
+                         * called from fallocate(FALLOC_FL_PUNCH_HOLE).
+                         * To prevent a livelock when someone else is faulting
+                         * pages back, we are content with single pass and do
+                         * not retry with index = start. It's important that
+                         * previous page content has been discarded, and
+                         * faulter(s) got new zeroed pages.
+                         *
+                         * The other callsites are shmem_setattr (for
+                         * truncation) and shmem_evict_inode, which set i_size
+                         * to truncated size or 0, respectively, and then call
+                         * us with lstart == inode->i_size. There we do want to
+                         * retry, and livelock cannot happen for other reasons.
+                         *
+                         * XXX what about i915_gem_object_truncate?
+                         */

I doubt you have ever faced such a criticism before, but I'm going
to speak my mind and say that comment is too long!  A comment of that
length is okay above or just inside or at a natural break in a function,
but here it distracts too much from what the code is actually doing.

Fair enough. The reasoning should have gone into commit log, not comment.

In particular, the words "this condition" are so much closer to the
condition above than the condition below, that it's rather confusing.

/* Single pass when hole-punching to not livelock on racing faults */
would have been enough (yes, I've cheated, that would be 2 or 4 lines).

+                        if (lstart < inode->i_size)

For a long time I was going to suggest that you leave i_size out of it,
and use "lend > 0" instead.  Then suddenly I realized that this is the
wrong place for the test.

Well my first idea was to just add a flag about how persistent it should
be. And set it false for the punch hole case. Then I wondered if there's
already some bit that distinguishes it. But it makes it more subtle.

And then that it's not your fault, it's mine,
in v3.1's d0823576bf4b "mm: pincer in truncate_inode_pages_range".
Wow, that really pessimized the hole-punch case!

When is pvec.nr 0?  When we've reached the end of the file.  Why should
we go to the end of the file, when punching a hole at the start?  Ughh!

Ah, I see (I think). But I managed to reproduce this problem when there
was only an extra page between lend and the end of file, so I doubt this
is the only problem. AFAIU it's enough to try punching a large enough
hole, then the loop can only do a single pagevec worth of pages per
iteration, which gives enough time for somebody faulting pages back?

+                                break;
   			index = start;
   			continue;
   		}
--
1.8.4.5

But there is another problem.  We cannot break out after one pass on
shmem, because there's a possiblilty that a swap entry in the radix_tree
got swizzled into a page just as it was about to be removed - your patch
might then leave that data behind in the hole.

Thanks, I didn't notice that. Do I understand correctly that this could
mean info leak for the punch hole call, but wouldn't be a problem for
madvise? (In any case, that means the solution is not general enough for
all kernels, so I'm asking just to be sure).

As it happens, Konstantin Khlebnikov suggested a patch for that a few
weeks ago, before noticing that it's already handled by the endless loop.
If we make that loop no longer endless, we need to add in Konstantin's
"if (shmem_free_swap) goto retry" patch.

Right now I'm thinking that my idiocy in d0823576bf4b may actually
be the whole of Trinity's problem: patch below.  If we waste time
traversing the radix_tree to end-of-file, no wonder that concurrent
faults have time to put something in the hole every time.

Sasha, may I trespass on your time, and ask you to revert the previous
patch from your tree, and give this patch below a try?  I am very
interested to learn if in fact it fixes it for you (as it did for me).

I will try this, but as I explained above, I doubt that alone will help.

Yep, it didn't help here.

However, I am wasting your time, in that I think we shall decide that
it's too unsafe to rely solely upon the patch below (what happens if
1024 cpus are all faulting on it while we try to punch a 4MB hole at

My reproducer is 4MB file, where the puncher tries punching everything
except first and last page. And there are 8 other threads (as I have 8
logical CPU's) that just repeatedly sweep the same range, reading only
the first byte of each page.

end of file? if we care).  I think we shall end up with the optimization
below (or some such: it can be written in various ways), plus reverting
d0823576bf4b's "index == start && " pincer, plus Konstantin's
shmem_free_swap handling, rolled into a single patch; and a similar

So that means no retry in any case (except the swap thing)? All callers
can handle that? I guess shmem_evict_inode would be ok, as nobody else
can be accessing that inode. But what about shmem_setattr? (i.e.
straight truncation) As you said earlier, faulters will get a SIGBUS
(which AFAIU is due to i_size being updated before we enter
shmem_undo_range). But could possibly a faulter already pass the i_size
test, and proceed with the fault only when we are already in
shmem_undo_range and have passed the page in question?

patch (without the swap part) for several functions in truncate.c.

Hugh

--- 3.16-rc2/mm/shmem.c	2014-06-16 00:28:55.124076531 -0700
+++ linux/mm/shmem.c	2014-06-25 10:28:47.063967052 -0700
@@ -470,6 +470,7 @@ static void shmem_undo_range(struct inod
   	for ( ; ; ) {
   		cond_resched();

+		index = min(index, end);
   		pvec.nr = find_get_entries(mapping, index,
   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
   				pvec.pages, indices);



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html