Re: [PATCH bpf] bpf: respect CAP_IPC_LOCK in RLIMIT_MEMLOCK check

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/11/19 8:18 PM, Christian Barcenas wrote:
A process can lock memory addresses into physical RAM explicitly
(via mlock, mlockall, shmctl, etc.) or implicitly (via VFIO,
perf ring-buffers, bpf maps, etc.), subject to RLIMIT_MEMLOCK limits.

CAP_IPC_LOCK allows a process to exceed these limits, and throughout
the kernel this capability is checked before allowing/denying an attempt
to lock memory regions into RAM.

Because bpf locks its programs and maps into RAM, it should respect
CAP_IPC_LOCK. Previously, bpf would return EPERM when RLIMIT_MEMLOCK was
exceeded by a privileged process, which is contrary to documented
RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior.

Do you have a link/pointer where this is /clearly/ documented?

I admit that after submitting this patch, I did re-think the description and thought maybe I should have described the CAP_IPC_LOCK behavior as "expected" rather than "documented". :)

... but my best guess is you are referring to `man 2 mlock`:

    Limits and permissions

       In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK)        in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines
        a limit on how much memory the process may lock.

       Since  Linux  2.6.9, no limits are placed on the amount of memory that a        privileged process can lock and the RLIMIT_MEMLOCK soft resource limit        instead defines a limit on how much memory an unprivileged process may lock.

Yes; this is what I was referring to by "documented RLIMIT_MEMLOCK+CAP_IPC_LOCK behavior."

Unfortunately - AFAICT - this is the most explicit documentation about CAP_IPC_LOCK's permission set, but it is incomplete.

I believe it can be understood from other references to RLIMIT and CAP_IPC_LOCK throughout the kernel that "locking memory" refers not only to mlock/shmctl syscalls, but also to other code sites where /physical/ memory addresses are allocated for userspace.

After identifying RLIMIT_MEMLOCK checks with

    git grep -C3 '[^(get|set)]rlimit(RLIMIT_MEMLOCK'

we find that RLIMIT_MEMLOCK is bypassed - if CAP_IPC_LOCK is held - in many locations that have nothing to do with the mlock or shm family of syscalls. From what I can tell, every time RLIMIT_MEMLOCK is referenced there is a neighboring check to CAP_IPC_LOCK that bypasses the rlimit, or in some cases memory accounting entirely!

bpf() is currently the only exception to the above, ie. as far as I can tell it is the only code that enforces RLIMIT_MEMLOCK but does not honor CAP_IPC_LOCK.

Selected examples follow:

In net/core/skbuff.c:

    if (capable(CAP_IPC_LOCK) || !size)
            return 0;

    num_pg = (size >> PAGE_SHIFT) + 2;      /* worst case */
    max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
    user = mmp->user ? : current_user();

    do {
            old_pg = atomic_long_read(&user->locked_vm);
            new_pg = old_pg + num_pg;
            if (new_pg > max_pg)
                    return -ENOBUFS;
    } while (atomic_long_cmpxchg(&user->locked_vm, old_pg, new_pg) !=
             old_pg);

In net/xdp/xdp_umem.c:

    if (capable(CAP_IPC_LOCK))
            return 0;

    lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
    umem->user = get_uid(current_user());

    do {
            old_npgs = atomic_long_read(&umem->user->locked_vm);
            new_npgs = old_npgs + umem->npgs;
            if (new_npgs > lock_limit) {
                    free_uid(umem->user);
                    umem->user = NULL;
                    return -ENOBUFS;
            }
    } while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
                                 new_npgs) != old_npgs);
    return 0;

In arch/x86/kvm/svm.c:

    lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
    if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
pr_err("SEV: %lu locked pages exceed the lock limit of %lu.\n", locked, lock_limit);
            return NULL;
    }

In drivers/infiniband/core/umem.c (and other sites in Infiniband code):

    lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

    new_pinned = atomic64_add_return(npages, &mm->pinned_vm);
    if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
            atomic64_sub(npages, &mm->pinned_vm);
            ret = -ENOMEM;
            goto out;
    }

In drivers/vfio/vfio_iommu_type1.c, albeit in an indirect way:

    struct vfio_dma {
        bool                 lock_cap;       /* capable(CAP_IPC_LOCK) */
    };

    // ...

    for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
         pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
            // ...

            if (!rsvd && !vfio_find_vpfn(dma, iova)) {
                    if (!dma->lock_cap &&
                        current->mm->locked_vm + lock_acct + 1 > limit) {
                            put_pfn(pfn, dma->prot);
                            pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
                                    __func__, limit << PAGE_SHIFT);
                            ret = -ENOMEM;
                            goto unpin_out;
                    }
                    lock_acct++;
            }
    }

Best,
Christian



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux