On 5/21/2023 7:53 PM, Tianjia Zhang wrote: > Hi Casey, > > On 5/18/23 8:01 AM, Casey Schaufler wrote: >> On 5/16/2023 5:05 AM, Tianjia Zhang wrote: >>> Hi Casey, >>> >>> On 5/12/23 12:17 AM, Casey Schaufler wrote: >>>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote: >>>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN. >>>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is >>>>> included >>>>> within CAP_SYS_ADMIN. >>>>> >>>>> Some database products rely on shared storage to complete the >>>>> write-once-read-multiple and write-multiple-read-multiple functions. >>>>> When HA occurs, they rely on the PR (Persistent Reservations) >>>>> protocol >>>>> provided by the storage layer to manage block device permissions to >>>>> ensure data correctness. >>>>> >>>>> CAP_SYS_ADMIN is required in the PR protocol implementation of >>>>> existing >>>>> block devices in the Linux kernel, which has too many sensitive >>>>> permissions, which may lead to risks such as container escape. The >>>>> kernel needs to provide more fine-grained permission management like >>>>> CAP_NET_ADMIN to avoid online products directly relying on root to >>>>> run. >>>>> >>>>> CAP_BLOCK_ADMIN can also provide support for other block device >>>>> operations that require CAP_SYS_ADMIN capabilities in the future, >>>>> ensuring that applications run with least privilege. >>>> >>>> Can you demonstrate that there are cases where a program that needs >>>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other >>>> operations? >>>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by >>>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to >>>> justify. >>>> >>> >>> For the previous non-container scenarios, the block device is a shared >>> device, because the business-system generally operates the file system >>> on the block. Therefore, directly operating the block device has a high >>> probability of affecting other processes on the same host, and it is a >>> reasonable requirement to need the CAP_SYS_ADMIN capability. >>> >>> But for a database running in a container scenario, especially a >>> container scenario on the cloud, it is likely that a container >>> exclusively occupies a block device. That is to say, for a container, >>> its access to the block device will not affect other process, there is >>> no need to obtain a higher CAP_SYS_ADMIN capability. >> >> If I understand correctly, you're saying that the process that requires >> CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for >> other operations. >> >> That's good, but it isn't clear how a process on bare metal would >> require CAP_SYS_ADMIN while the same process in a container wouldn't. >> >>> >>> For a file system similar to distributed write-once-read-many, it is >>> necessary to ensure the correctness of recovery, then when recovery >>> occurs, it is necessary to ensure that no inflighting-io is completed >>> after recovery. >>> >>> This can be guaranteed by performing operations such as SCSI/NVME >>> Persistent Reservations on block devices on the distributed file >>> system. >> >> Does your cloud based system always run "real" devices? My >> understanding is that cloud based deployment usually uses >> virtual machines and virtio or other simulated devices. >> A container deployment in the cloud seems unlikely to be able >> to take advantage of block administration. But I can't say >> I know the specifics of your environment. >> >>> Therefore, at present, it is only necessary to have the relevant >>> permission support of the control command of such container-exclusive >>> block devices. >> >> This looks like an extremely special case in which breaking out >> block management would make sense. >> > Our scenario is like this. In simply terms, a distributed database has > a read-write instance and one or more read-only instances. Each instance > runs in an isolated container. All containers share the same block > device. > > In addition to the database instance, there is also a control program > running on the control plane in the container. The database ensures > the correctness of the data through the PR (Persistent Reservations) > of the block device. This operation is also the only operation in the > container that requires CAP_SYS_ADMIN privileges. > > This system as a whole, whether it is running on VM or bare metal, the > difference is not big. > > In order to support the PR of block devices, we need to grant > CAP_SYS_ADMIN permissions to the container, which not only greatly > increases the risk of container escape, but also makes us have to > carefully configure the permissions of the container. Many container > escapes that have occurred are also caused by these reasons. > > This is essentially a problem of permission isolation. We hope to > share the smallest possible permissions from CAP_SYS_ADMIN to support > necessary operations, and avoid providing CAP_SYS_ADMIN permissions > to containers as much as possible. Your use case is interesting, but not compelling. While you may have come up with a specific case where you can completely break CAP_BLOCK_ADMIN out from CAP_SYS_ADMIN, it's hardly general. > > Kind regards, > Tianjia >