Le 10/07/2024 à 14:45, Lorenzo Stoakes a écrit : > On Wed, Jul 10, 2024 at 10:28:01PM GMT, Michael Ellerman wrote: >> Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> writes: >>> On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote: >>>> >> ... >>>> The functionality here has changed >>>> --- from --- >>>> may_expand_vm() check >>>> can_modify_mm() check >>>> arch_unmap() >>>> vms_gather_munmap_vmas() >>>> ... >>>> >>>> --- to --- >>>> can_modify_mm() check >>>> arch_unmap() >>>> vms_gather_munmap_vmas() >>>> may_expand_vm() check >>>> ... >>>> >>>> vms_gather_munmap_vmas() does nothing but figures out what to do later, >>>> but could use memory and can fail. >>>> >>>> The user implications are: >>>> >>>> 1. The return type on the error may change to -EPERM from -ENOMEM, if >>>> you are not allowed to expand and are trying to overwrite mseal()'ed >>>> VMAs. That seems so very rare that I'm not sure it's worth mentioning. >>>> >>>> >>>> 2. arch_unmap() called prior to may_expand_vm(). >>>> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is >>>> within the unmap range. User implication of this means that an >>>> application my set the vdso to NULL prior to hitting the -ENOMEM case in >>>> may_expand_vm() due to the address space limit. >>>> >>>> Assuming the removal of the vdso does not cause the application to seg >>>> fault, then the user visible change is that any vdso call after a failed >>>> mmap(MAP_FIXED) call would result in a seg fault. The only reason it >>>> would fail is if the mapping process was attempting to map a large >>>> enough area over the vdso (which is accounted and in the vma tree, >>>> afaict) and ran out of memory. Note that this situation could arise >>>> already since we could run out of memory (not accounting) after the >>>> arch_unmap() call within the kernel. >>>> >>>> The code today can suffer the same fate, but not by the accounting >>>> failure. It can happen due to failure to allocate a new vma, >>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other >>>> failure scenarios later in the mmap_region() function. >>>> >>>> At the very least, this requires an expanded change log. >>> >>> Indeed, also (as mentioned on IRC) I feel like we need to look at whether >>> we _truly_ need this arch_unmap() call for a single, rather antiquated, >>> architecture. >> >> You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is >> factually wrong :) Power10 came out of the fab just a few years ago at >> 7nm. > > Fair point ;) perhaps we could go with "rarified"? :>) > >> >>> I mean why are they unmapping the VDSO, why is that valid, why does it need >>> that field to be set to NULL, is it possible to signify that in some other >>> way etc.? >> >> It was originally for CRIU. So a niche workload on a niche architecture. >> >> But from the commit that added it, it sounds like CRIU was using mremap, >> which should be handled these days by vdso_mremap(). So it could be that >> arch_unmap() is not actually needed for CRIU anymore. > > Oh that's interesting! > >> >> Then I guess we have to decide if removing our arch_unmap() would be an >> ABI break, regardless of whether CRIU needs it or not. > > Seems to me like an internal implementation detail that should hopefully > not result in anything that should have visible ABI impact? > > I guess this is something we ought to assess. It would be useful to > eliminate hooks where we can so we can better control VMA behaviour without > having to worry about an arch being able to do arbitrary things at > unexpected times, especially pertinent where we change the order of things. > I see you are talking about arch_unmap(). I didn't follow the entire discussion but we have some related stuff here: https://github.com/linuxppc/issues/issues/241 If I remember correctly arch_unmap() should have gone away we Dmitry's series https://lore.kernel.org/lkml/20210611180242.711399-1-dima@xxxxxxxxxx/#r but it hasn't been applied yet. Christophe