On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger <adilger@xxxxxxxxx> wrote: > On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@xxxxxxxxx> wrote: >> >> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: >>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] >> Yes, PIDs have traditionally been global, but today we have PID namespaces, and >> many other isolation features such as mount namespaces. Nothing is perfect, of >> course, and containers are a lot worse than VMs, but it seems weird to use that >> as an excuse to knowingly make things worse... >> Indeed. Not only PID namespaces -- we have hidepid and we can simply unmount /proc. "There are other info leaks" is a poor excuse. >>> >>>>> Fortunately, the days of timesharing seem to well behind us. For >>>>> those people who think that containers are as secure as VM's (hah, >>>>> hah, hah), it might be that best way to handle this is to have a mount >>>>> option that requires root access to this functionality. For those >>>>> people who really care about this, they can disable access. >>> >>> Or use separate filesystems for each container so that exploitable bugs >>> that shut down the filesystem can't be used to kill the other >>> containers. You could use a torrent of metadata-heavy operations >>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS >>> the other containers. >>> >>>> What would be the reason for not putting this behind >>>> capable(CAP_SYS_ADMIN)? >>>> >>>> What possible legitimate function could this functionality serve to >>>> users who don't own your filesystem? >>> >>> As I've said before, it's to enable dedupe tools to decide, given a set >>> of files with shareable blocks, roughly how many other times each of >>> those shareable blocks are shared so that they can make better decisions >>> about which file keeps its shareable blocks, and which file gets >>> remapped. Dedupe is not a privileged operation, nor are any of the >>> tools. >>> >> >> So why does the ioctl need to return all extent mappings for the entire >> filesystem, instead of just the share count of each block in the file that the >> ioctl is called on? > > One possibility is that the ioctl() can return the mapping for all inodes > owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE, > or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more > than one if there is a reason to do so) with all the other allocated blocks > for inodes the user doesn't have permission to access? Sounds like it could be reasonable. But you don't want "owned by the calling PID" precisely -- you also need to check kgid_has_mapping(current_user_ns(), inode->i_gid), I think.