Re: [PATCH v5 4/6] mm: make vma cache SLAB_TYPESAFE_BY_RCU

Vlastimil Babka <vbabka@xxxxxxx> · Wed, 11 Dec 2024 17:05:56 +0100

On 12/11/24 16:30, Suren Baghdasaryan wrote:
> On Tue, Dec 10, 2024 at 3:01 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>>
>> On Tue, Dec 10, 2024 at 9:25 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>> >
>> > On 12/10/24 18:16, Suren Baghdasaryan wrote:
>> > > On Tue, Dec 10, 2024 at 8:32 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>> > >>
>> > >> On 12/10/24 17:20, Suren Baghdasaryan wrote:
>> > >> > On Tue, Dec 10, 2024 at 6:21 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>> > >> >>
>> > >> >> On 12/6/24 23:52, Suren Baghdasaryan wrote:
>> > >> >> > To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that
>> > >> >> > object reuse before RCU grace period is over will be detected inside
>> > >> >> > lock_vma_under_rcu().
>> > >> >> > lock_vma_under_rcu() enters RCU read section, finds the vma at the
>> > >> >> > given address, locks the vma and checks if it got detached or remapped
>> > >> >> > to cover a different address range. These last checks are there
>> > >> >> > to ensure that the vma was not modified after we found it but before
>> > >> >> > locking it.
>> > >> >> > vma reuse introduces several new possibilities:
>> > >> >> > 1. vma can be reused after it was found but before it is locked;
>> > >> >> > 2. vma can be reused and reinitialized (including changing its vm_mm)
>> > >> >> > while being locked in vma_start_read();
>> > >> >> > 3. vma can be reused and reinitialized after it was found but before
>> > >> >> > it is locked, then attached at a new address or to a new mm while
>> > >> >> > read-locked;
>> > >> >> > For case #1 current checks will help detecting cases when:
>> > >> >> > - vma was reused but not yet added into the tree (detached check)
>> > >> >> > - vma was reused at a different address range (address check);
>> > >> >> > We are missing the check for vm_mm to ensure the reused vma was not
>> > >> >> > attached to a different mm. This patch adds the missing check.
>> > >> >> > For case #2, we pass mm to vma_start_read() to prevent access to
>> > >> >> > unstable vma->vm_mm. This might lead to vma_start_read() returning
>> > >> >> > a false locked result but that's not critical if it's rare because
>> > >> >> > it will only lead to a retry under mmap_lock.
>> > >> >> > For case #3, we ensure the order in which vma->detached flag and
>> > >> >> > vm_start/vm_end/vm_mm are set and checked. vma gets attached after
>> > >> >> > vm_start/vm_end/vm_mm were set and lock_vma_under_rcu() should check
>> > >> >> > vma->detached before checking vm_start/vm_end/vm_mm. This is required
>> > >> >> > because attaching vma happens without vma write-lock, as opposed to
>> > >> >> > vma detaching, which requires vma write-lock. This patch adds memory
>> > >> >> > barriers inside is_vma_detached() and vma_mark_attached() needed to
>> > >> >> > order reads and writes to vma->detached vs vm_start/vm_end/vm_mm.
>> > >> >> > After these provisions, SLAB_TYPESAFE_BY_RCU is added to vm_area_cachep.
>> > >> >> > This will facilitate vm_area_struct reuse and will minimize the number
>> > >> >> > of call_rcu() calls.
>> > >> >> >
>> > >> >> > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
>> > >> >>
>> > >> >> I'm wondering about the vma freeing path. Consider vma_complete():
>> > >> >>
>> > >> >> vma_mark_detached(vp->remove);
>> > >> >>   vma->detached = true; - plain write
>> > >> >> vm_area_free(vp->remove);
>> > >> >>   vma->vm_lock_seq = UINT_MAX; - plain write
>> > >> >>   kmem_cache_free(vm_area_cachep)
>> > >> >> ...
>> > >> >> potential reallocation
>> > >> >>
>> > >> >> against:
>> > >> >>
>> > >> >> lock_vma_under_rcu()
>> > >> >> - mas_walk finds a stale vma due to race
>> > >> >> vma_start_read()
>> > >> >>   if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(mm->mm_lock_seq.sequence))
>> > >> >>   - can be false, the vma was not being locked on the freeing side?
>> > >> >>   down_read_trylock(&vma->vm_lock.lock) - suceeds, wasn't locked
>> > >> >>     this is acquire, but was there any release?
>> > >> >
>> > >> > Yes, there was a release. I think what you missed is that
>> > >> > vma_mark_detached() that is called from vma_complete() requires VMA to
>> > >> > be write-locked (see vma_assert_write_locked() in
>> > >> > vma_mark_detached()). The rule is that a VMA can be attached without
>> > >> > write-locking but only a write-locked VMA can be detached. So, after
>> > >>
>> > >> OK but write unlocking means the mm's seqcount is bumped and becomes
>> > >> non-equal with vma's vma->vm_lock_seq, right?
>> > >>
>> > >> Yet in the example above we happily set it to UINT_MAX and thus effectively
>> > >> false unlock it for vma_start_read()?
>> > >>
>> > >> And this is all done before the vma_complete() side would actually reach
>> > >> mmap_write_unlock(), AFAICS.
>> > >
>> > > Ah, you are right. With the possibility of reuse, even a freed VMA
>> > > should be kept write-locked until it is unlocked by
>> > > mmap_write_unlock(). I think the fix for this is simply to not reset
>> > > vma->vm_lock_seq inside vm_area_free(). I'll also need to add a
>> >
>> > But even if we don't reset vm_lock_seq to UINT_MAX, then whover reallocated
>> > it can proceed and end up doing a vma_start_write() and rewrite it there
>> > anyway, no?
>>
>> Actually, I think with a small change we can simplify these locking rules:
>>
>> static inline void vma_start_write(struct vm_area_struct *vma)
>> {
>>         int mm_lock_seq;
>>
>> -        if (__is_vma_write_locked(vma, &mm_lock_seq))
>> -                return;
>> +        mmap_assert_write_locked(vma->vm_mm);
>> +        mm_lock_seq = vma->vm_mm->mm_lock_seq;
>>
>>         down_write(&vma->vm_lock->lock);
>>         /*
>>         * We should use WRITE_ONCE() here because we can have concurrent reads
>>         * from the early lockless pessimistic check in vma_start_read().
>>         * We don't really care about the correctness of that early check, but
>>         * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
>>         */
>>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>>         up_write(&vma->vm_lock->lock);
>> }
>>
>> This will force vma_start_write() to always write-lock vma->vm_lock
>> before changing vma->vm_lock_seq. Since vma->vm_lock survives reuse,
>> the other readers/writers will synchronize on it even if vma got
>> reused.
> 
> After thinking of all the alternatives, I think the cleanest way to
> handle vma detaching would be to follow the same pattern as for vma
> attaching. To attach a vma we do:
> 
> vma->vm_mm = xxx;
> ...
> vma_mark_attached()
>     smp_wmb();
>     WRITE_ONCE(vma->detached, false);
> 
> 
> lock_vma_under_rcu() ensures that a vma is attached and still
> unchanged like this:
> 
> lock_vma_under_rcu()
>     vma_start_read();
>     is_vma_detached()
>         detached = READ_ONCE(vma->detached);
>         smp_rmb();
>     if (vma->vm_mm != mm)
> 
> So, vm_area_free() can follow the same pattern to ensure vma reuse
> gets detected even if lock_vma_under_rcu() succeeds in locking the
> vma:
> 
> vm_area_free()
>     vma->vm_mm = NULL;
>     smp_wmb();
>     WRITE_ONCE(vma->detached, true);
> 
> Vlastimil, I think that should address the race you described. WDYT?

I'm not sure. AFAIU the barriers would ensure that if lock_vma_under_rcu()
sees detached, it also sees vm_mm is NULL. But as it doesn't ensure that it
will see it detached, so it also doesn't ensure we will see vm_mm as NULL.

I think the main problem is that we unlock the vma by writing to a mm, not
the vma, which makes it hard to apply the necessary SLAB_TYPESAFE_BY_RCU
validation patterns to it. I thought the direction you were discussing with
PeterZ in the other thread would solve this (in addition of getting rid of
the rwsem, which we were considering it anyway, but enabling
SLAB_TYPESAFE_BY_RCU by that would be a stronger argument).

Perhaps a solution to this that would work with the current rwsem would be
that setting detached and vm_mm to NULL would be set under the down_write()
of the rwsem. That would make sure that if lock_vma_under_rcu() succeeds the
down_read_trylock(), it would be guaranteed to see those assignments?

>>
>> >
>> > > comment for vm_lock_seq explaining these requirements.
>> > > Do you agree that such a change would resolve the issue?
>> > >
>> > >>
>> > >> > vma_mark_detached() and before down_read_trylock(&vma->vm_lock.lock)
>> > >> > in vma_start_read() the VMA write-lock should have been released by
>> > >> > mmap_write_unlock() and therefore vma->detached=false should be
>> > >> > visible to the reader when it executed lock_vma_under_rcu().
>> > >> >
>> > >> >>   is_vma_detached() - false negative as the write above didn't propagate
>> > >> >>     here yet; a read barrier but where is the write barrier?
>> > >> >>   checks for vma->vm_mm, vm_start, vm_end - nobody reset them yet so false
>> > >> >>     positive, or they got reset on reallocation but writes didn't propagate
>> > >> >>
>> > >> >> Am I missing something that would prevent lock_vma_under_rcu() falsely
>> > >> >> succeeding here?
>> > >> >>
>> > >>
>> >