Re: [PATCH v16 06/11] mm: introduce memfd_secret system call to create "secret" memory areas

David Hildenbrand <david@xxxxxxxxxx> · Tue, 26 Jan 2021 10:53:08 +0100



On 26.01.21 10:49, Michal Hocko wrote:
> On Tue 26-01-21 11:20:11, Mike Rapoport wrote:
>> On Tue, Jan 26, 2021 at 10:00:13AM +0100, Michal Hocko wrote:
>>> On Tue 26-01-21 10:33:11, Mike Rapoport wrote:
>>>> On Tue, Jan 26, 2021 at 08:16:14AM +0100, Michal Hocko wrote:
>>>>> On Mon 25-01-21 23:36:18, Mike Rapoport wrote:
>>>>>> On Mon, Jan 25, 2021 at 06:01:22PM +0100, Michal Hocko wrote:
>>>>>>> On Thu 21-01-21 14:27:18, Mike Rapoport wrote:
>>>>>>>> From: Mike Rapoport <rppt@xxxxxxxxxxxxx>
>>>>>>>>
>>>>>>>> Introduce "memfd_secret" system call with the ability to create memory
>>>>>>>> areas visible only in the context of the owning process and not mapped not
>>>>>>>> only to other processes but in the kernel page tables as well.
>>>>>>>>
>>>>>>>> The user will create a file descriptor using the memfd_secret() system
>>>>>>>> call. The memory areas created by mmap() calls from this file descriptor
>>>>>>>> will be unmapped from the kernel direct map and they will be only mapped in
>>>>>>>> the page table of the owning mm.
>>>>>>>>
>>>>>>>> The secret memory remains accessible in the process context using uaccess
>>>>>>>> primitives, but it is not accessible using direct/linear map addresses.
>>>>>>>>
>>>>>>>> Functions in the follow_page()/get_user_page() family will refuse to return
>>>>>>>> a page that belongs to the secret memory area.
>>>>>>>>
>>>>>>>> A page that was a part of the secret memory area is cleared when it is
>>>>>>>> freed.
>>>>>>>>
>>>>>>>> The following example demonstrates creation of a secret mapping (error
>>>>>>>> handling is omitted):
>>>>>>>>
>>>>>>>> 	fd = memfd_secret(0);
>>>>>>>> 	ftruncate(fd, MAP_SIZE);
>>>>>>>> 	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>>>>>>
>>>>>>> I do not see any access control or permission model for this feature.
>>>>>>> Is this feature generally safe to anybody?
>>>>>>
>>>>>> The mappings obey memlock limit. Besides, this feature should be enabled
>>>>>> explicitly at boot with the kernel parameter that says what is the maximal
>>>>>> memory size secretmem can consume.
>>>>>
>>>>> Why is such a model sufficient and future proof? I mean even when it has
>>>>> to be enabled by an admin it is still all or nothing approach. Mlock
>>>>> limit is not really useful because it is per mm rather than per user.
>>>>>
>>>>> Is there any reason why this is allowed for non-privileged processes?
>>>>> Maybe this has been discussed in the past but is there any reason why
>>>>> this cannot be done by a special device which will allow to provide at
>>>>> least some permission policy?
>>>>  
>>>> Why this should not be allowed for non-privileged processes? This behaves
>>>> similarly to mlocked memory, so I don't see a reason why secretmem should
>>>> have different permissions model.
>>>
>>> Because appart from the reclaim aspect it fragments the direct mapping
>>> IIUC. That might have an impact on all others, right?
>>
>> It does fragment the direct map, but first it only splits 1G pages to 2M
>> pages and as was discussed several times already it's not that clear which
>> page size in the direct map is the best and this is very much workload
>> dependent.
> 
> I do appreciate this has been discussed but this changelog is not
> specific on any of that reasoning and I am pretty sure nobody will
> remember details in few years in the future. Also some numbers would be
> appropriate.
> 
>> These are the results of the benchmarks I've run with the default direct
>> mapping covered with 1G pages, with disabled 1G pages using "nogbpages" in
>> the kernel command line and with the entire direct map forced to use 4K
>> pages using a simple patch to arch/x86/mm/init.c.
>>
>> https://docs.google.com/spreadsheets/d/1tdD-cu8e93vnfGsTFxZ5YdaEfs2E1GELlvWNOGkJV2U/edit?usp=sharing
> 
> A good start for the data I am asking above.

I assume you've seen the benchmark results provided by Xing Zhengjun

https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@xxxxxxxxxxxxxxx/

-- 
Thanks,

David / dhildenb