On Fri, Nov 11, 2022 at 12:53:36PM +0100, Christian Brauner wrote: > A while ago Vasily reported that it is possible to set a large number of > xattrs on inodes of filesystems that make use of the simple xattr > infrastructure. This includes all kernfs-based filesystems that support > xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be > mounted by unprivileged users in unprivileged containers and root in an > unprivileged container can set an unrestricted number of security.* > xattrs and privileged users can also set unlimited trusted.* xattrs. As > there are apparently users that have a fairly large number of xattrs we > should scale a bit better. Other xattrs such as user.* are restricted > for kernfs-based instances to a fairly limited number. > > Using a simple linked list protected by a spinlock used for set, get, > and list operations doesn't scale well if users use a lot of xattrs even > if it's not a crazy number. There's no need to bring in the big guns > like rhashtables or rw semaphores for this. An rbtree with a rwlock, or > limited rcu semanics and seqlock is enough. > > It scales within the constraints we are working in. By far the most > common operation is getting an xattr. Setting xattrs should be a > moderately rare operation. And listxattr() often only happens when > copying xattrs between files or with the filey. Holding a lock across > listxattr() is unproblematic because it doesn't list the values of > xattrs. It can only be used to list the names of all xattrs set on a > file. And the number of xattr names that can be listed with listxattr() > is limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is > passed then vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr > names are found it will return -EFBIG. In short, the maximum amount of > memory that can be retrieved via listxattr() is limited. > > Of course, the API is broken as documented on xattr(7) already. In the > future we might want to address this but for now this is the world we > live in and have lived for a long time. But it does indeed mean that > once an application goes over XATTR_LIST_MAX limit of xattrs set on an > inode it isn't possible to copy the file and include its xattrs in the > copy unless the caller knows all xattrs or limits the copy of the xattrs > to important ones it knows by name (At least for tmpfs, and kernfs-based > filesystems. Other filesystems might provide ways of achieving this.). > > Bonus of this port to rbtree+rwlock is that we shrink the memory > consumption for users of the simple xattr infrastructure. > > Also add proper kernel documentation to all the functions. > A big thanks to Paul for his comments. > > Cc: Vasily Averin <vvs@xxxxxxxxxx> > Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx> > Signed-off-by: Christian Brauner (Microsoft) <brauner@xxxxxxxxxx> > --- > > Notes: > In the v1 and v2 of this patch we used an rbtree which was protected by an > rcu+seqlock combination. During the discussion it became clear that there's > some valid concern how safe it is to combine rcu with rbtrees. While most of > the issues are highly unlikely to matter in practice the code here can be > reached by unprivileged users rather directly so we should not be adventurous. > Instead of the rcu+seqlock combination simply use an rwlock. This will scale > sufficiently as well and had been one of the implementations I considered and > wrote a little while ago. Thanks to Paul for some deeper insights into issues > associated with rcu and rbtrees! > > In addition to this patch I would like to propose that we restrict the number > of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In > other words, we restrict the number of xattrs for simple xattr filesystems to > the number of xattrs names that can be retrieved via listxattr(). That should > be about 2000 to 3000 xattrs per inode which is more than enough. We should > risk this and see if we get any regression reports from userswith this > approach. > > This should be as simple as adding a max_list member to struct simple_xattrs > and initialize it with XATTR_MAX_LIST. Set operations would then check against > this field whether the new xattr they are trying to set will fit and return > -EFBIG otherwise. I think that might be a good approach to get rid of the in > principle unbounded number of xattrs that can be set via the simple xattr > infrastructure. I think this is a regression risk worth taking. > > /* v2 */ > Christian Brauner <brauner@xxxxxxxxxx>: > - Fix kernel doc. > - Remove accidental leftover union from previous iteration. > > /* v3 */ > Port the whole thing to use a simple rwlock instead of rcu+seqlock. > > "Paul E. McKenney" <paulmck@xxxxxxxxxx>: > - Fix simple_xattr_add() by searching the correct slot in the rbtree first. > > Roman Gushchin <roman.gushchin@xxxxxxxxx>: > - Avoid calling strcmp() multiple times. > > fs/xattr.c | 317 +++++++++++++++++++++++++++++++++--------- > include/linux/xattr.h | 38 ++--- > mm/shmem.c | 2 +- > 3 files changed, 260 insertions(+), 97 deletions(-) Acked-by: Roman Gushchin <roman.gushchin@xxxxxxxxx> Thanks!