On Mon 15-02-21 10:14:43, James Bottomley wrote: > On Mon, 2021-02-15 at 10:13 +0100, Michal Hocko wrote: > > On Sun 14-02-21 11:21:02, James Bottomley wrote: > > > On Sun, 2021-02-14 at 10:58 +0100, David Hildenbrand wrote: > > > [...] > > > > > And here we come to the question "what are the differences that > > > > > justify a new system call?" and the answer to this is very > > > > > subjective. And as such we can continue bikeshedding forever. > > > > > > > > I think this fits into the existing memfd_create() syscall just > > > > fine, and I heard no compelling argument why it shouldn‘t. That‘s > > > > all I can say. > > > > > > OK, so let's review history. In the first two incarnations of the > > > patch, it was an extension of memfd_create(). The specific > > > objection by Kirill Shutemov was that it doesn't share any code in > > > common with memfd and so should be a separate system call: > > > > > > https://lore.kernel.org/linux-api/20200713105812.dnwtdhsuyj3xbh4f@box/ > > > > Thanks for the pointer. But this argument hasn't been challenged at > > all. It hasn't been brought up that the overlap would be considerable > > higher by the hugetlb/sealing support. And so far nobody has claimed > > those combinations as unviable. > > Kirill is actually interested in the sealing path for his KVM code so > we took a look. There might be a two line overlap in memfd_create for > the seal case, but there's no real overlap in memfd_add_seals which is > the bulk of the code. So the best way would seem to lift the inode ... > -> seals helpers to be non-static so they can be reused and roll our > own add_seals. These are implementation details which are not really relevant to the API IMHO. > I can't see a use case at all for hugetlb support, so it seems to be a > bit of an angels on pin head discussion. However, if one were to come > along handling it in the same way seems reasonable. Those angels have made their way to mmap, System V shm, memfd_create and other MM interfaces which have never envisioned when introduced. Hugetlb pages to back guest memory is quite a common usecase so why do you think those guests wouldn't like to see their memory be "secret"? As I've said in my last response (YCZEGuLK94szKZDf@xxxxxxxxxxxxxx), I am not going to argue all these again. I have made my point and you are free to take it or leave it. > > > The other objection raised offlist is that if we do use > > > memfd_create, then we have to add all the secret memory flags as an > > > additional ioctl, whereas they can be specified on open if we do a > > > separate system call. The container people violently objected to > > > the ioctl because it can't be properly analysed by seccomp and much > > > preferred the syscall version. > > > > > > Since we're dumping the uncached variant, the ioctl problem > > > disappears but so does the possibility of ever adding it back if we > > > take on the container peoples' objection. This argues for a > > > separate syscall because we can add additional features and extend > > > the API with flags without causing anti-ioctl riots. > > > > I am sorry but I do not understand this argument. > > You don't understand why container guarding technology doesn't like > ioctls? No, I did not see where the ioctl argument came from. [...] > > What kind of flags are we talking about and why would that be a > > problem with memfd_create interface? Could you be more specific > > please? > > You mean what were the ioctl flags in the patch series linked above? > They were SECRETMEM_EXCLUSIVE and SECRETMEM_UNCACHED in patch 3/5. OK I see. How many potential modes are we talking about? A few or potentially many? > They were eventually dropped after v10, because of problems with > architectural semantics, with the idea that it could be added back > again if a compelling need arose: > > https://lore.kernel.org/linux-api/20201123095432.5860-1-rppt@xxxxxxxxxx/ > > In theory the extra flags could be multiplexed into the memfd_create > flags like hugetlbfs is but with 32 flags and a lot already taken it > gets messy for expansion. When we run out of flags the first question > people will ask is "why didn't you do separate system calls?". OK, I do not necessarily see a lack of flag space a problem. I can be wrong here but I do not see how that would be solved by a separate syscall when it sounds rather forseeable that many modes supported by memfd_create will eventually find their way to a secret memory as well. If for no other reason, secret memory is nothing really special. It is just a memory which is not mapped to the kernel via 1:1 mapping. That's it. And that can be applied to any memory provided to the userspace. But I am repeating myself again here so I better stop. -- Michal Hocko SUSE Labs