Hello Yang Xu, On 5/12/21 10:53 PM, Yang Xu wrote: > hugetlb_shm_group contains group id that is allowed to create SysV shared > memory segment using hugetlb page. To meet EPERM error, we also > need to make group id be not in this proc file. > > Signed-off-by: Yang Xu <xuyang2018.jy@xxxxxxxxxxx> > --- > man2/shmget.2 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/man2/shmget.2 b/man2/shmget.2 > index 757b7b7f1..29799b9b8 100644 > --- a/man2/shmget.2 > +++ b/man2/shmget.2 > @@ -273,7 +273,7 @@ The > .B SHM_HUGETLB > flag was specified, but the caller was not privileged (did not have the > .B CAP_IPC_LOCK > -capability). > +capability and group id doesn't be contained in hugetlb_shm_group proc file). > .SH CONFORMING TO > POSIX.1-2001, POSIX.1-2008, SVr4. > .\" SVr4 documents an additional error condition EEXIST. Thanks for spotting this. The story is more complex, as far as I can tell. For example, the same error also occurs for mmap(2) and memfd_create(2) Instead of your patch, I applied the diff below (not yet pushed), based on my reading of fs/hugetlbfs/inode.c, in particular: static int can_do_hugetlb_shm(void) { kgid_t shm_group; shm_group = make_kgid(&init_user_ns, sysctl_hugetlb_shm_group); return capable(CAP_IPC_LOCK) || in_group_p(shm_group); } ... struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag, struct user_struct **user, int creat_flags, int page_size_log) { ... if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) { *user = current_user(); if (user_shm_lock(size, *user)) { task_lock(current); pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n", current->comm, current->pid); task_unlock(current); } else { *user = NULL; return ERR_PTR(-EPERM); } } ... } As a deprecated feature, it appears that the RLIMIT_MEMLOCK can also be used to permit huge page allocation, but I have chose not to document that for now. Please let me know if the patch makes sense to you. With best regards, Michael --- a/man2/memfd_create.2 +++ b/man2/memfd_create.2 @@ -201,6 +201,19 @@ The .BR memfd_create () system call first appeared in Linux 3.17; glibc support was added in version 2.27. +.TP +.B EPERM +The +.B MFD_HUGETLB +flag was specified, but the caller was not privileged (did not have the +.B CAP_IPC_LOCK +capability) +and is not a member of the +.I sysctl_hugetlb_shm_group +group; see the description of +.I /proc/sys/vm/sysctl_hugetlb_shm_group +in +.BR proc (5). .SH CONFORMING TO The .BR memfd_create () diff --git a/man2/mmap.2 b/man2/mmap.2 index 03f2eeb2c..4ee2f4f96 100644 --- a/man2/mmap.2 +++ b/man2/mmap.2 @@ -628,6 +628,18 @@ was mounted no-exec. The operation was prevented by a file seal; see .BR fcntl (2). .TP +.B EPERM +The +.B MAP_HUGETLB +flag was specified, but the caller was not privileged (did not have the +.B CAP_IPC_LOCK +capability) +and is not a member of the +.I sysctl_hugetlb_shm_group +group; see the description of +.I /proc/sys/vm/sysctl_hugetlb_shm_group +in +.TP .B ETXTBSY .B MAP_DENYWRITE was set but the object specified by diff --git a/man2/shmget.2 b/man2/shmget.2 index 757b7b7f1..6e9995e81 100644 --- a/man2/shmget.2 +++ b/man2/shmget.2 @@ -273,7 +273,13 @@ The .B SHM_HUGETLB flag was specified, but the caller was not privileged (did not have the .B CAP_IPC_LOCK -capability). +capability) +and is not a member of the +.I sysctl_hugetlb_shm_group +group; see the description of +.I /proc/sys/vm/sysctl_hugetlb_shm_group +in +.BR proc (5). .SH CONFORMING TO POSIX.1-2001, POSIX.1-2008, SVr4. .\" SVr4 documents an additional error condition EEXIST. diff --git a/man5/proc.5 b/man5/proc.5 index a28dbdcc7..888535449 100644 --- a/man5/proc.5 +++ b/man5/proc.5 @@ -5603,6 +5603,19 @@ user should run .BR sync (1) first. .TP +.IR /proc/sys/vm/sysctl_hugetlb_shm_group " (since Linux 2.6.7)" +This writable file contains a group ID that is allowed +to allocate memory using huge pages. +If a process has a filesystem group ID or any supplememtary group ID that +matches this group ID, +then it can make huge-page allocations without holding the +.BR CAP_IPC_LOCK +capability; see +.BR memfd_create (2), +.BR mmap (2), +and +.BR shmget (2). +.TP .IR /proc/sys/vm/legacy_va_layout " (since Linux 2.6.9)" .\" The following is from Documentation/filesystems/proc.txt If nonzero, this disables the new 32-bit memory-mapping layout; diff --git a/man7/capabilities.7 b/man7/capabilities.7 index 7e79b2fb6..cf9dc190f 100644 --- a/man7/capabilities.7 +++ b/man7/capabilities.7 @@ -205,11 +205,21 @@ the filesystem or any of the supplementary GIDs of the calling process. .B CAP_IPC_LOCK .\" FIXME . As at Linux 3.2, there are some strange uses of this capability .\" in other places; they probably should be replaced with something else. +.PD 0 +.RS +.IP * 2 Lock memory .RB ( mlock (2), .BR mlockall (2), .BR mmap (2), +.BR shmctl (2)); +.IP * +Allocate memory using huge pages +.RB ( memfd_create (2) +.BR mmap (2), .BR shmctl (2)). +.PD 0 +.RE .TP .B CAP_IPC_OWNER Bypass permission checks for operations on System V IPC objects. $ -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/