On Fri, 2018-01-05 at 12:25 -0700, Jason Gunthorpe wrote: > On Fri, Jan 05, 2018 at 01:06:58PM -0500, Doug Ledford wrote: > > > Do the userspace daemon's still manage the connection to SRP? > > > > > > If yes, then the networking information should be relative to the > > > namespace of the thing that wrote to the sysfs file.. > > > > Maybe, maybe not. It depends on the implementation. IIRC you get one > > daemon per port, not one daemon per mount. > > I don't think it depends - if we expose this sysfs file to a container Who says we have to do that? We could make the sysfs file only visible in the init namespace and let the init namespace daemon control what namespaces have what views. That was my point, the implementation can be flexible. And actually, if most containers mount sysfs ro as you say below, then the init namespace daemon would need to create the namespace views anyway. We could just make that mandatory by refusing to create devices from anything other than init_net namespace. Then even if someone does mount sysfs rw in a container, we're still good. > then anything less than using the contain'd net namespace sounds like > it is a path to allow the container to escape its net namespace. I'm a little concerned that this is a problem now regardless. > The complication here is that sysfs creates a device, and that device > is currently created in the host namespace. Let's assume, for the sake of what I'm writing below, that we modify the srp daemon so that every line in the srp_daemon.conf file can optionally specify a namespace, and when present, the daemon will pass that to the kernel, and when present the kernel code creates the *device* file for that device in that specific namespace (which is really the only thing we care about...for a filesystem based access as opposed to direct device access, you want to create the device file in the init_net namespace and mount the device in the init_net namespace and then follow the typical filesystem namespace rules for determining what the client namespaces can see, and in that situation the client need know nothing about SRP, it is only using a filesystem in a namespace). > So from a security perspective containers shouldn't even have access > to this thing at all without more work to ensure that the created > block device is also restriced inside the container. This isn't sufficient. The block device created must be constrained within the container, but if we allow direct device access to the underlying LUN on the target, then that target LUN must be exclusively owned by the container. No other container, nor the host, can be allowed to have any access of any sort or it becomes a message passing bypass around containerization. It becomes easier then to allow the init_net daemon to create all of the devices, and once it creates a single mapping to any LUN, that LUN can not be reused for any other mapping. So, a LUN can be either A) a mounted filesystem in the init_net namespace with other namespaces carved out of the filesystem as appropriate or B) a direct access device that is accessible in exactly one namespace only. We can't actually rely on the srp_daemon to enforce this, we have to do it at the kernel level, but I think that's what we need to do (if we don't simply bar direct device access from a container, period). The only difficulty I see here is multipath. You still want to support it, especially for the host OS, but at the same time, you can't allow a container to get one path and a different container to get another path to the same device. > Since it is a sysfs file, and most container systems mount syfs ro, we > can probably get away with ignoring namespaces for now? > > But using the current process namespace is also a good choice. > > In princinple there can be multiple srp_daemons if they can coordinate > which ones do which. For instance a container could run its own > srp_daemon restricted to the pkeys the container has access to. If the > device stuff above was fixed then this would even make some sense... > > Otherwise srp_daemon has to run in the host namespace, where the > created devices end up and it rightly should not see the netdevices > that are assigned to other namespaces. This problem is made more difficult by the fact that there is persistent storage at the other end of the connection. It doesn't really matter what netdevice we access a target through. If the accesses go to the same physical media at the other end, then they can't be shared across namespaces without creating a containerization leak. With netdevices we have a unique MAC/vlan/IP tuple of data, and remote systems only know us by that and our containerized code can't reach beyond those boundaries. But with disks, the issue is different. If we allow direct device access in the container, then (as best we can, there may be problems we simply can't solve) we need the container bubble to extend all the way around the physical media we are allowing access to on the remote target system. We might just have to turn off all direct device file access in containers for iser and srp and nvmeof... -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
Attachment:
signature.asc
Description: This is a digitally signed message part