On Tue, Jan 14, 2025 at 06:19:32PM +0000, Lorenzo Stoakes wrote: > On Tue, Jan 14, 2025 at 06:14:57PM +0000, Lorenzo Stoakes wrote: > > This is getting into realms of discussion so to risk sounding rude - to be > > clear - NACK. > > > > The user-visible change to /proc/$pid/[s]maps kills this patch dead. This > > is regardless of any other discussed issue. > > > > But more importantly, I hadn't realise mmap_zero() was on the .mmap() > > callback (sorry my mistake) - you're simply not permitted to change > > vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and > > it's broken. > > I see shmem_zero_page() does change vma->vm_page, this is broken... ugh. I > will audit this code (and have a look through _all_ mmap() callbacks I > guess). Duly added to TODO. But definitely can't have _another_ case of > doing this. * vma->vm_file... it is late here :) > > > > > To me the alternative would be to have a custom fault handler that hands > > back the zero page, but I"m not sure that's workable, you'd have to install > > a special mapping etc. and huge pages are weird and... > > > > I do appreciate you raising this especially as I was blissfully unaware, > > but I don't see how this patch can possibly work, sorry :( > > > > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > > > > > > > > > > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > > > + Willy for the fs/weirdness elements of this. > > > > > > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > > > When creating private mapping for /dev/zero, the driver makes it an > > > > > anonymous mapping by calling set_vma_anonymous(). But it just sets > > > > > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > > > Hm yikes. > > > > > > > > > This is a special case and the VMA doesn't look like either anonymous VMA > > > > > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > > > > > > > > > It seems pointless to keep such special case. Making private /dev/zero > > > > > mapping a full anonymous mapping doesn't change the semantic of > > > > > /dev/zero either. > > > > My concern is that ostensibly there _is_ a file right? Are we certain that by > > > > not setting this we are not breaking something somewhere else? > > > > > > > > Are we not creating a sort of other type of 'non-such-beast' here? > > > > > > But the file is /dev/zero. I don't see this could break the semantic of > > > /dev/zero. The shared mapping of /dev/zero is not affected by this change, > > > kernel already treated private mapping of /dev/zero as anonymous mapping, > > > but with some weird settings in VMA. When reading the mapping, it returns 0 > > > with zero page, when writing the mapping, a new anonymous folio is > > > allocated. > > > > You're creating a new concept of an anon but not anon but also now with > > anon vm_pgoff and missing vm_file even though it does reference a file > > and... yeah. > > > > This is not usual :) > > > > > > > > > > > > > I mean already setting it anon and setting vm_file non-NULL is really strange. > > > > > > > > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > > > > > and /proc/<PID>/maps. > > > > > > > > > > Before the change: > > > > > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > > > > > > > > > After the change: > > > > > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > > > > > > > > > Yeah this seems like it might break somebody to be honest, it's really > > > > really really strange to map a file then for it not to be mapped. > > > > > > Yes, it is possible if someone really care whether the anonymous-like > > > mapping is mapped by /dev/zero or just created by malloc(). But I don't know > > > who really do... > > > > > > > > > > > But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a > > > > file but for it to be marked anonymous. > > > > > > > > God what a mess. > > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@xxxxxxxxxx/ > > > > I kind of hate that we have to mitigate like this for a case that should > > > > never ever happen so I'm inclined towards your solution but a lot more > > > > inclined towards us totally rethinking this. > > > > > > > > Do we _have_ to make this anonymous?? Why can't we just reference the zero > > > > page as if it were in the page cache (Willy - feel free to correct naive > > > > misapprehension here). > > > > > > TBH, I don't see why page cache has to be involved. When reading, 0 is > > > returned by zero page. When writing a CoW is triggered if page cache is > > > involved, but the content of the page cache should be just 0, so we copy 0 > > > to the new folio then write to it. It doesn't make too much sense. I think > > > this is why private /dev/zero mapping is treated as anonymous mapping in the > > > first place. > > > > I'm obviously not suggesting allocating a bunch of extra folios, I was > > thinking there would be some means of handing back the actual zero > > page. But I am not sure this is workable. > > > > > > > > > > > > > > Signed-off-by: Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx> > > > > > --- > > > > > drivers/char/mem.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > > > > > index 169eed162a7f..dae113f7fc1b 100644 > > > > > --- a/drivers/char/mem.c > > > > > +++ b/drivers/char/mem.c > > > > > @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) > > > > > if (vma->vm_flags & VM_SHARED) > > > > > return shmem_zero_setup(vma); > > > > > vma_set_anonymous(vma); > > > > > + fput(vma->vm_file); > > > > > + vma->vm_file = NULL; > > > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > > > This is just not permitted. We maintain mmap state which contains the file > > and pgoff state which gets threaded through the mapping operation, and > > simply do not expect you to change these fields. > > > > In future we will assert on this or preferably, restrict users to only > > changing VMA flags, the private field and vm_ops. > > > > > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > > > this will be wrong? But then we'd have no way of tracking it correctly... > > > > > > I'm not quite familiar with the subtle details and corner cases of > > > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > > > been visible to user yet at this point IIUC. How come mremap() could move > > > it? > > > > Ah OK, in that case fine on that front. > > > > But you are not permitted to touch this field (we need to enforce this...) > > > > > > > > > > > > > I've not checked the function but do we mark this as a special mapping of > > > > some kind? > > > > > > > > > + > > > > > return 0; > > > > > } > > > > > > > > > > -- > > > > > 2.47.0 > > > > > > > >