On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team <kernel-team@xxxxxxxxxxx> wrote: > > On Wed 18-01-23 14:23:32, Jann Horn wrote: > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > > > +locking maintainers > > > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock, > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > > > > locked. > > > > [...] > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > > +{ > > > > > + up_read(&vma->lock); > > > > > +} > > > > > > > > One thing that might be gnarly here is that I think you might not be > > > > allowed to use up_read() to fully release ownership of an object - > > > > from what I remember, I think that up_read() (unlike something like > > > > spin_unlock()) can access the lock object after it's already been > > > > acquired by someone else. > > > > > > Yes, I think you are right. From a look into the code it seems that > > > the UAF is quite unlikely as there is a ton of work to be done between > > > vma_write_lock used to prepare vma for removal and actual removal. > > > That doesn't make it less of a problem though. > > > > > > > So if you want to protect against concurrent > > > > deletion, this might have to be something like: > > > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > > up_read(&vma->lock); > > > > rcu_read_unlock(); > > > > > > > > But I'm not entirely sure about that, the locking folks might know better. > > > > > > I am not a locking expert but to me it looks like this should work > > > because the final cleanup would have to happen rcu_read_unlock. > > > > > > Thanks, I have completely missed this aspect of the locking when looking > > > into the code. > > > > > > Btw. looking at this again I have fully realized how hard it is actually > > > to see that vm_area_free is guaranteed to sync up with ongoing readers. > > > vma manipulation functions like __adjust_vma make my head spin. Would it > > > make more sense to have a rcu style synchronization point in > > > vm_area_free directly before call_rcu? This would add an overhead of > > > uncontended down_write of course. > > > > Something along those lines might be a good idea, but I think that > > rather than synchronizing the removal, it should maybe be something > > that splats (and bails out?) if it detects pending readers. If we get > > to vm_area_free() on a VMA that has pending readers, we might already > > be in a lot of trouble because the concurrent readers might have been > > traversing page tables while we were tearing them down or fun stuff > > like that. > > > > I think maybe Suren was already talking about something like that in > > another part of this patch series but I don't remember... > > This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@xxxxxxxxxx? Yes, I spent a lot of time ensuring that __adjust_vma locks the right VMAs and that VMAs are freed or isolated under VMA write lock protection to exclude any readers. If the VM_BUG_ON_VMA in the patch Michal mentioned gets hit then it's a bug in my design and I'll have to fix it. But please, let's not add synchronize_rcu() in the vm_area_free(). That will slow down any path that frees a VMA, especially the exit path which might be freeing thousands of them. I had an SPF version with synchronize_rcu() in the vm_area_free() and phone vendors started yelling at me the very next day. call_rcu() with CONFIG_RCU_NOCB_CPU (which Android uses for power saving purposes) is already bad enough to show up in the benchmarks and that's why I had to add call_rcu() batching in https://lore.kernel.org/all/20230109205336.3665937-40-surenb@xxxxxxxxxx. > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx. >