Re: [PATCH -V5 RESEND 03/21] swap: Support PMD swap mapping in swap_duplicate()

Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> · Thu, 27 Sep 2018 14:12:39 -0700

On Thu, Sep 27, 2018 at 09:34:36AM +0800, Huang, Ying wrote:
> Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> writes:
> > On Wed, Sep 26, 2018 at 08:55:59PM +0800, Huang, Ying wrote:
> >> Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> writes:
> >> > On Tue, Sep 25, 2018 at 03:13:30PM +0800, Huang Ying wrote:
> >> >>  /*
> >> >>   * Increase reference count of swap entry by 1.
> >> >> - * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
> >> >> - * but could not be atomically allocated.  Returns 0, just as if it succeeded,
> >> >> - * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
> >> >> - * might occur if a page table entry has got corrupted.
> >> >> + *
> >> >> + * Return error code in following case.
> >> >> + * - success -> 0
> >> >> + * - swap_count_continuation is required but could not be atomically allocated.
> >> >> + *   *entry is used to return swap entry to call add_swap_count_continuation().
> >> >> + *								      -> ENOMEM
> >> >> + * - otherwise same as __swap_duplicate()
> >> >>   */
> >> >> -int swap_duplicate(swp_entry_t entry)
> >> >> +int swap_duplicate(swp_entry_t *entry, int entry_size)
> >> >>  {
> >> >>  	int err = 0;
> >> >>  
> >> >> -	while (!err && __swap_duplicate(entry, 1) == -ENOMEM)
> >> >> -		err = add_swap_count_continuation(entry, GFP_ATOMIC);
> >> >> +	while (!err &&
> >> >> +	       (err = __swap_duplicate(entry, entry_size, 1)) == -ENOMEM)
> >> >> +		err = add_swap_count_continuation(*entry, GFP_ATOMIC);
> >> >>  	return err;
> >> >
> >> > Now we're returning any error we get from __swap_duplicate, apparently to
> >> > accommodate ENOTDIR later in the series, which is a change from the behavior
> >> > introduced in 570a335b8e22 ("swap_info: swap count continuations").  This might
> >> > belong in a separate patch given its potential for side effects.
> >> 
> >> I have checked all the calls of the function and found there will be no
> >> bad effect.  Do you have any side effect?
> >
> > Before I was just being vaguely concerned about any unintended side effects,
> > but looking again, yes I do.
> >
> > Now when swap_duplicate returns an error in copy_one_pte, copy_one_pte returns
> > a (potentially nonzero) entry.val, which copy_pte_range interprets
> > unconditionally as 'try adding a swap count continuation.'  Not what we want
> > for returns other than -ENOMEM.
> 
> Thanks for pointing this out!  Before the change in the patchset, the
> behavior is,
> 
> Something wrong is detected in swap_duplicate(), but the error is
> ignored.  Then copy_one_pte() will think everything is OK, so that it
> can proceed to call set_pte_at().  The system will be in inconsistent
> state and some data may be polluted!

Yes, the part about page table corruption in the comment above swap_duplicate.

> But this doesn't cause any problem in practical.  Per my understanding,
> because if other part of the kernel works correctly, it's impossible for
> swap_duplicate() return any error except -ENOMEM before the change in
> this patchset.

I agree with that, but it's not what I'm trying to explain.  I didn't go into
enough detail, let me try again.  Hopefully I'm understanding this right.

While running with these patches, say we're at

  copy_pte_range
   copy_one_pte
    swap_duplicate
     __swap_duplicate
      __swap_duplicate_locked

And say __swap_duplicate_locked returns an error that isn't -ENOMEM, such as
-EEXIST.  That means __swap_duplicate and swap_duplicate also return -EEXIST.
copy_one_pte returns entry.val, which can be and usually is nonzero, so we
break out of the loop in copy_pte_range and then--erroneously--call
add_swap_count_continuation.

The add_swap_count_continuation call was added in 570a335b8e22 and relies on
the assumption that callers can only get -ENOMEM from swap_duplicate.  This
patch changes that assumption.

Not a big deal: the continuation call just returns early, no harm done, but it
allocs and frees a page needlessly, so we should fix it.  One way is to change
copy_one_pte's return to int so we can just pass the error code back to
copy_pte_range so it knows whether to try adding the continuation.

The other swap_duplicate caller, try_to_unmap_one, seems ok.

> But the error may be possible during development, and it
> may serve as some kind of document too.  So I suggest to add
> 
> VM_BUG_ON(error != -ENOMEM);
> 
> in swap_duplicate().  What do you think about that?

That doesn't seem necessary.

> > So it might make sense to have a separate patch that changes swap_duplicate's
> > return and makes callers handle it.
> 
> Thanks for your help to take a deep look at this.  I want to try to fix
> all potential problems firstly, because the number of the caller is
> quite limited.  Do you agree?

Yes, makes sense to me.

Daniel