On Thu, Jan 18, 2024 at 9:58 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Tue, Jan 16, 2024 at 9:57 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > On Tue, Jan 16, 2024 at 6:46 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > > > > > On 1/16/24 15:42, Vlastimil Babka wrote: > > > > On 1/15/24 19:38, Suren Baghdasaryan wrote: > > > > > > > > Hi, > > > > > > > >> The issue this patchset is trying to address is mmap_lock contention when > > > >> a low priority task (monitoring, data collecting, etc.) blocks a higher > > > >> priority task from making updated to the address space. The contention is > > > >> due to the mmap_lock being held for read when reading proc/pid/maps. > > > >> With maple_tree introduction, VMA tree traversals are RCU-safe and per-vma > > > >> locks make VMA access RCU-safe. this provides an opportunity for lock-less > > > >> reading of proc/pid/maps. We still need to overcome a couple obstacles: > > > >> 1. Make all VMA pointer fields used for proc/pid/maps content generation > > > >> RCU-safe; > > > >> 2. Ensure that proc/pid/maps data tearing, which is currently possible at > > > >> page boundaries only, does not get worse. > > > > > > > > Hm I thought we were to only choose this more complicated in case additional > > > > tearing becomes a problem, and at first assume that if software can deal > > > > with page boundary tearing, it can deal with sub-page tearing too? > > > > Hi Vlastimil, > > Thanks for the feedback! > > Yes, originally I thought we wouldn't be able to avoid additional > > tearing without a big change but then realized it's not that hard, so > > I tried to keep the change in behavior transparent to the userspace. > > In the absence of other feedback I'm going to implement and post the > originally envisioned approach: remove validation step and avoid any > possibility of blocking but allowing for sub-page tearing. Will use > Matthew's rwsem_wait() to deal with possible inconsistent maple_tree > state. I posted v1 at https://lore.kernel.org/all/20240122071324.2099712-1-surenb@xxxxxxxxxx/ In the RFC I used mm_struct.mm_lock_seq to detect if mm is being changed but then I realized that won't work. mm_struct.mm_lock_seq is incremented after mm is changed and right before mmap_lock is write-unlocked. Instead I need a counter that changes once we write-lock mmap_lock and before any mm changes. So the new patchset introduces a separate counter to detect possible mm changes. In addition, I could not use rwsem_wait() and instead had to take mmap_lock for read to wait for the writer to finish and then record the new counter while holding mmap_lock for read. That prevents concurrent mm changes while we are recording the new counter value. > Thanks, > Suren. > > > > > > > > > > >> The patchset deals with these issues but there is a downside which I would > > > >> like to get input on: > > > >> This change introduces unfairness towards the reader of proc/pid/maps, > > > >> which can be blocked by an overly active/malicious address space modifyer. > > > > > > > > So this is a consequence of the validate() operation, right? We could avoid > > > > this if we allowed sub-page tearing. > > > > Yes, if we don't care about sub-page tearing then we could get rid of > > validate step and this issue with updaters blocking the reader would > > go away. If we choose that direction there will be one more issue to > > fix, namely the maple_tree temporary inconsistent state when a VMA is > > replaced with another one and we might observe NULL there. We might be > > able to use Matthew's rwsem_wait() to deal with that issue. > > > > > > > > > >> A couple of ways I though we can address this issue are: > > > >> 1. After several lock-less retries (or some time limit) to fall back to > > > >> taking mmap_lock. > > > >> 2. Employ lock-less reading only if the reader has low priority, > > > >> indicating that blocking it is not critical. > > > >> 3. Introducing a separate procfs file which publishes the same data in > > > >> lock-less manner. > > > > > > Oh and if this option 3 becomes necessary, then such new file shouldn't > > > validate() either, and whoever wants to avoid the reader contention and > > > converts their monitoring to the new file will have to account for this > > > possible extra tearing from the start. So I would suggest trying to change > > > the existing file with no validate() first, and if existing userspace gets > > > broken, employ option 3. This would mean no validate() in either case? > > > > Yes but I was trying to avoid introducing additional file which > > publishes the same content in a slightly different way. We will have > > to explain when userspace should use one vs the other and that would > > require going into low level implementation details, I think. Don't > > know if that's acceptable/preferable. > > Thanks, > > Suren. > > > > > > > > >> I imagine a combination of these approaches can also be employed. > > > >> I would like to get feedback on this from the Linux community. > > > >> > > > >> Note: mmap_read_lock/mmap_read_unlock sequence inside validate_map() > > > >> can be replaced with more efficiend rwsem_wait() proposed by Matthew > > > >> in [1]. > > > >> > > > >> [1] https://lore.kernel.org/all/ZZ1+ZicgN8dZ3zj3@xxxxxxxxxxxxxxxxxxxx/ > > > >> > > > >> Suren Baghdasaryan (3): > > > >> mm: make vm_area_struct anon_name field RCU-safe > > > >> seq_file: add validate() operation to seq_operations > > > >> mm/maps: read proc/pid/maps under RCU > > > >> > > > >> fs/proc/internal.h | 3 + > > > >> fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++---- > > > >> fs/seq_file.c | 24 ++++++- > > > >> include/linux/mm_inline.h | 10 ++- > > > >> include/linux/mm_types.h | 3 +- > > > >> include/linux/seq_file.h | 1 + > > > >> mm/madvise.c | 30 +++++++-- > > > >> 7 files changed, 181 insertions(+), 20 deletions(-) > > > >> > > > > > > >