Re: [PATCH v3 1/7] mm: allow drivers to prevent new writable mappings

David Herrmann <dh.herrmann@xxxxxxxxx> · Sat, 19 Jul 2014 18:12:01 +0200

Hi Hugh

On Wed, Jul 9, 2014 at 10:55 AM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> The i_mmap_writable field counts existing writable mappings of an
>> address_space. To allow drivers to prevent new writable mappings, make
>> this counter signed and prevent new writable mappings if it is negative.
>> This is modelled after i_writecount and DENYWRITE.
>>
>> This will be required by the shmem-sealing infrastructure to prevent any
>> new writable mappings after the WRITE seal has been set. In case there
>> exists a writable mapping, this operation will fail with EBUSY.
>>
>> Note that we rely on the fact that iff you already own a writable mapping,
>> you can increase the counter without using the helpers. This is the same
>> that we do for i_writecount.
>>
>> Signed-off-by: David Herrmann <dh.herrmann@xxxxxxxxx>
>
> I'm very pleased with the way you chose to do this in the end, following
> the way VM_DENYWRITE does it: it works out nicer than I had imagined.
>
> I do have some comments below, but the only one where I end up asking
> for a change is to remove the void "return "; but please read through
> in case any of my wonderings lead you to change your mind on something.
>
> I am very tempted to suggest that you add VM_WRITE into those VM_SHARED
> tests, and put the then necessary additional tests into mprotect.c: so
> that you would no longer have the odd case that a VM_SHARED !VM_WRITE
> area has to be unmapped before sealing the fd.
>
> But that would involve an audit of flush_dcache_page() architectures,
> and perhaps some mods to keep them safe across the mprotect(): let's
> put it down as a desirable future enhancement, just to remove an odd
> restriction.
>
> With the "return " gone,
> Acked-by: Hugh Dickins <hughd@xxxxxxxxxx>

Thanks for the review! Yes, allowing VM_SHARED without VM_WRITE would
be nice, but I think it would make this patchset more complex than it
already is. I agree that it's better done separately.

>> ---
>>  fs/inode.c         |  1 +
>>  include/linux/fs.h | 29 +++++++++++++++++++++++++++--
>>  kernel/fork.c      |  2 +-
>>  mm/mmap.c          | 24 ++++++++++++++++++------
>>  mm/swap_state.c    |  1 +
>>  5 files changed, 48 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 6eecb7f..9945b11 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -165,6 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>>       mapping->a_ops = &empty_aops;
>>       mapping->host = inode;
>>       mapping->flags = 0;
>> +     atomic_set(&mapping->i_mmap_writable, 0);
>>       mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>>       mapping->private_data = NULL;
>>       mapping->backing_dev_info = &default_backing_dev_info;
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 338e6f7..71d17c9 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -387,7 +387,7 @@ struct address_space {
>>       struct inode            *host;          /* owner: inode, block_device */
>>       struct radix_tree_root  page_tree;      /* radix tree of all pages */
>>       spinlock_t              tree_lock;      /* and lock protecting it */
>> -     unsigned int            i_mmap_writable;/* count VM_SHARED mappings */
>> +     atomic_t                i_mmap_writable;/* count VM_SHARED mappings */
>>       struct rb_root          i_mmap;         /* tree of private and shared mappings */
>>       struct list_head        i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
>>       struct mutex            i_mmap_mutex;   /* protect tree, count, list */
>> @@ -470,10 +470,35 @@ static inline int mapping_mapped(struct address_space *mapping)
>>   * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
>>   * marks vma as VM_SHARED if it is shared, and the file was opened for
>>   * writing i.e. vma may be mprotected writable even if now readonly.
>> + *
>> + * If i_mmap_writable is negative, no new writable mappings are allowed. You
>> + * can only deny writable mappings, if none exists right now.
>>   */
>>  static inline int mapping_writably_mapped(struct address_space *mapping)
>>  {
>> -     return mapping->i_mmap_writable != 0;
>> +     return atomic_read(&mapping->i_mmap_writable) > 0;
>
> I have a slight anxiety that you're halving the writable range here:
> but I think that if we can overflow with this change, we could overflow
> before; and there are int counts elsewhere which would overflow easier.

Currently you need a new vm_area_struct to increase that counter.
vm_area_struct is about 144 bytes, times INT_MAX is something around
300GB. This does indeed look like a limit that can be reached not far
from now on common machines. I somehow doubt it's the only atomic that
might overflow soon, though. And there's always RLIMIT_AS you can set
to avoid such attacks.

I wouldn't oppose to using atomic_long_t, though. Many VM counters
already use atomic_long_t instead of atomic_t.

>> +}
>> +
>> +static inline int mapping_map_writable(struct address_space *mapping)
>> +{
>> +     return atomic_inc_unless_negative(&mapping->i_mmap_writable) ?
>> +             0 : -EPERM;
>> +}
>> +
>> +static inline void mapping_unmap_writable(struct address_space *mapping)
>> +{
>> +     return atomic_dec(&mapping->i_mmap_writable);
>
> Better just
>
>         atomic_dec(&mapping->i_mmap_writable);
>
> I didn't realize before that the compiler allows you to return a void
> function from a void function without warning, but better not rely on that.

Fixed, thanks.

>> +}
>> +
>> +static inline int mapping_deny_writable(struct address_space *mapping)
>> +{
>> +     return atomic_dec_unless_positive(&mapping->i_mmap_writable) ?
>> +             0 : -EBUSY;
>> +}
>> +
>> +static inline void mapping_allow_writable(struct address_space *mapping)
>> +{
>> +     atomic_inc(&mapping->i_mmap_writable);
>>  }
>>
>>  /*
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index d2799d1..f1f127e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -421,7 +421,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>>                               atomic_dec(&inode->i_writecount);
>>                       mutex_lock(&mapping->i_mmap_mutex);
>>                       if (tmp->vm_flags & VM_SHARED)
>> -                             mapping->i_mmap_writable++;
>> +                             atomic_inc(&mapping->i_mmap_writable);
>>                       flush_dcache_mmap_lock(mapping);
>>                       /* insert tmp into the share list, just after mpnt */
>>                       if (unlikely(tmp->vm_flags & VM_NONLINEAR))
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 129b847..19b6562 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -216,7 +216,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
>>       if (vma->vm_flags & VM_DENYWRITE)
>>               atomic_inc(&file_inode(file)->i_writecount);
>>       if (vma->vm_flags & VM_SHARED)
>> -             mapping->i_mmap_writable--;
>> +             mapping_unmap_writable(mapping);
>
> Okay.  I waste ridiculous time trying to decide whether it's better to say
>
>                 mapping_unmap_writable(mapping)
> or
>                 atomic_dec(&mapping->i_mmap_writable);
>
> here: there are consistency arguments both ways.  I usually solve this
> problem by not giving wrapper names to such operations, but in this case
> I think the answer is just to accept whatever you decided was best.
>
>>
>>       flush_dcache_mmap_lock(mapping);
>>       if (unlikely(vma->vm_flags & VM_NONLINEAR))
>> @@ -617,7 +617,7 @@ static void __vma_link_file(struct vm_area_struct *vma)
>>               if (vma->vm_flags & VM_DENYWRITE)
>>                       atomic_dec(&file_inode(file)->i_writecount);
>>               if (vma->vm_flags & VM_SHARED)
>> -                     mapping->i_mmap_writable++;
>> +                     atomic_inc(&mapping->i_mmap_writable);
>>
>>               flush_dcache_mmap_lock(mapping);
>>               if (unlikely(vma->vm_flags & VM_NONLINEAR))
>> @@ -1572,6 +1572,11 @@ munmap_back:
>>                       if (error)
>>                               goto free_vma;
>>               }
>> +             if (vm_flags & VM_SHARED) {
>> +                     error = mapping_map_writable(file->f_mapping);
>> +                     if (error)
>> +                             goto allow_write_and_free_vma;
>> +             }
>>               vma->vm_file = get_file(file);
>>               error = file->f_op->mmap(file, vma);
>>               if (error)
>> @@ -1611,8 +1616,12 @@ munmap_back:
>>
>>       vma_link(mm, vma, prev, rb_link, rb_parent);
>>       /* Once vma denies write, undo our temporary denial count */
>> -     if (vm_flags & VM_DENYWRITE)
>> -             allow_write_access(file);
>> +     if (file) {
>> +             if (vm_flags & VM_SHARED)
>> +                     mapping_unmap_writable(file->f_mapping);
>> +             if (vm_flags & VM_DENYWRITE)
>> +                     allow_write_access(file);
>> +     }
>>       file = vma->vm_file;
>
> Very good.  A bit subtle, and for a while I was preparing to warn you
> of the odd shmem_zero_setup() case (which materializes a file when
> none came in), and the danger of the count going wrong on that.
>
> But you have it handled, with the "if (file)" block immediately before
> the new assignment of file, which should force anyone to think about it.
> And the ordering here, with respect to file and error exits, has always
> been tricky - I don't think you are making it any worse.

I tried to keep it as similar to VM_DENYWRITE as possible. It's not
really obvious and I had to read it several times, too. But looks all
fine to me now.

> Oh, more subtle than I realized above: I've just gone through a last
> minute "hey, it's wrong; oh, no, it's okay" waver here, remembering
> how mmap /dev/zero (and some others, I suppose) gives you a new file.
>
> That is okay: we don't even have to assume that sealing is limited
> to your memfd_create(): it's safe to assume that any device which
> forks you a new file, is safe to assert mapping_map_writable() upon
> temporarily; and the new file given could never come already sealed.
>
> But maybe a brief comment to reassure us in future?

I added a comment explaining that ->mmap has to make sure vma_link()
succeeds in raising these counters in case it modified vma->vm_file.
All users I found return files that haven't been exposed to user-space
at that point, therefore they're safe. They also don't do initial
sealing or VM_DENYWRITE, so they are safe in both regards.

>>  out:
>>       perf_event_mmap(vma);
>> @@ -1641,14 +1650,17 @@ out:
>>       return addr;
>>
>>  unmap_and_free_vma:
>> -     if (vm_flags & VM_DENYWRITE)
>> -             allow_write_access(file);
>>       vma->vm_file = NULL;
>>       fput(file);
>>
>>       /* Undo any partial mapping done by a device driver. */
>>       unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
>>       charged = 0;
>> +     if (vm_flags & VM_SHARED)
>> +             mapping_unmap_writable(file->f_mapping);
>> +allow_write_and_free_vma:
>> +     if (vm_flags & VM_DENYWRITE)
>> +             allow_write_access(file);
>
> Okay.  I don't enjoy that rearrangement, such that those two uses of
> file come after the fput(file); but I believe there are good reasons
> why it can never be the final fput(file), so I think you're okay.

Actually, the fput(file) should rather be "put(vma->vm_file). The
reference we free here is not the reference that we got passed by the
syscall-context, but it's the reference we took for vma->vm_file.
Therefore, I think it's fine to access "file" afterwards, as this is
an independent reference. The vma->vm_file reference is also taken
_after_ doing VM_DENYWRITE and VM_SHARED accounting. Therefore, I put
the cleanup of vma->vm_file before reverting the counters for
symmetry's sake.

Note that we cannot turn fput(file) into fput(vma->vm_file), though.
The vma->vm_file field might be left set to something else after
->mmap() failed.

Thanks
David

>>  free_vma:
>>       kmem_cache_free(vm_area_cachep, vma);
>>  unacct_error:
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 2972eee..31321fa 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -39,6 +39,7 @@ static struct backing_dev_info swap_backing_dev_info = {
>>  struct address_space swapper_spaces[MAX_SWAPFILES] = {
>>       [0 ... MAX_SWAPFILES - 1] = {
>>               .page_tree      = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
>> +             .i_mmap_writable = ATOMIC_INIT(0),
>>               .a_ops          = &swap_aops,
>>               .backing_dev_info = &swap_backing_dev_info,
>>       }
>> --
>> 2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>