On 26.01.21 10:49, Michal Hocko wrote: > On Tue 26-01-21 11:20:11, Mike Rapoport wrote: >> On Tue, Jan 26, 2021 at 10:00:13AM +0100, Michal Hocko wrote: >>> On Tue 26-01-21 10:33:11, Mike Rapoport wrote: >>>> On Tue, Jan 26, 2021 at 08:16:14AM +0100, Michal Hocko wrote: >>>>> On Mon 25-01-21 23:36:18, Mike Rapoport wrote: >>>>>> On Mon, Jan 25, 2021 at 06:01:22PM +0100, Michal Hocko wrote: >>>>>>> On Thu 21-01-21 14:27:18, Mike Rapoport wrote: >>>>>>>> From: Mike Rapoport <rppt@xxxxxxxxxxxxx> >>>>>>>> >>>>>>>> Introduce "memfd_secret" system call with the ability to create memory >>>>>>>> areas visible only in the context of the owning process and not mapped not >>>>>>>> only to other processes but in the kernel page tables as well. >>>>>>>> >>>>>>>> The user will create a file descriptor using the memfd_secret() system >>>>>>>> call. The memory areas created by mmap() calls from this file descriptor >>>>>>>> will be unmapped from the kernel direct map and they will be only mapped in >>>>>>>> the page table of the owning mm. >>>>>>>> >>>>>>>> The secret memory remains accessible in the process context using uaccess >>>>>>>> primitives, but it is not accessible using direct/linear map addresses. >>>>>>>> >>>>>>>> Functions in the follow_page()/get_user_page() family will refuse to return >>>>>>>> a page that belongs to the secret memory area. >>>>>>>> >>>>>>>> A page that was a part of the secret memory area is cleared when it is >>>>>>>> freed. >>>>>>>> >>>>>>>> The following example demonstrates creation of a secret mapping (error >>>>>>>> handling is omitted): >>>>>>>> >>>>>>>> fd = memfd_secret(0); >>>>>>>> ftruncate(fd, MAP_SIZE); >>>>>>>> ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >>>>>>> >>>>>>> I do not see any access control or permission model for this feature. >>>>>>> Is this feature generally safe to anybody? >>>>>> >>>>>> The mappings obey memlock limit. Besides, this feature should be enabled >>>>>> explicitly at boot with the kernel parameter that says what is the maximal >>>>>> memory size secretmem can consume. >>>>> >>>>> Why is such a model sufficient and future proof? I mean even when it has >>>>> to be enabled by an admin it is still all or nothing approach. Mlock >>>>> limit is not really useful because it is per mm rather than per user. >>>>> >>>>> Is there any reason why this is allowed for non-privileged processes? >>>>> Maybe this has been discussed in the past but is there any reason why >>>>> this cannot be done by a special device which will allow to provide at >>>>> least some permission policy? >>>> >>>> Why this should not be allowed for non-privileged processes? This behaves >>>> similarly to mlocked memory, so I don't see a reason why secretmem should >>>> have different permissions model. >>> >>> Because appart from the reclaim aspect it fragments the direct mapping >>> IIUC. That might have an impact on all others, right? >> >> It does fragment the direct map, but first it only splits 1G pages to 2M >> pages and as was discussed several times already it's not that clear which >> page size in the direct map is the best and this is very much workload >> dependent. > > I do appreciate this has been discussed but this changelog is not > specific on any of that reasoning and I am pretty sure nobody will > remember details in few years in the future. Also some numbers would be > appropriate. > >> These are the results of the benchmarks I've run with the default direct >> mapping covered with 1G pages, with disabled 1G pages using "nogbpages" in >> the kernel command line and with the entire direct map forced to use 4K >> pages using a simple patch to arch/x86/mm/init.c. >> >> https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing > > A good start for the data I am asking above. I assume you've seen the benchmark results provided by Xing Zhengjun https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@xxxxxxxxxxxxxxx/ -- Thanks, David / dhildenb