Re: [PATCH] shmem: add support for user extended attributes

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 15 Aug 2023 10:23:53 +0200

On Tue, Aug 15, 2023 at 09:46:22AM +0200, Franklin “Snaipe” Mathieu wrote:
> On Tue, Aug 15, 2023 at 5:52 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> >
> > Thanks for the encouragement.  At the time that I wrote that (20 July)
> > I did not expect to get around to it for months.  But there happens to
> > have been various VFS-involving works going on in mm/shmem.c recently,
> > targeting v6.6: which caused me to rearrange priorities, and join the
> > party with tmpfs user xattrs, see
> >
> > https://lore.kernel.org/linux-fsdevel/e92a4d33-f97-7c84-95ad-4fed8e84608c@xxxxxxxxxx/
> >
> > Which Christian Brauner quickly put into his vfs.git (vfs.tmpfs branch):
> > so unless something goes horribly wrong, you can expect them in v6.6.
> 
> That's great to hear, thanks!
> 
> > There's a lot that you wrote above which I have no understanding of
> > whatsoever (why would user xattrs stop rmdir failing?? it's okay, don't
> > try to educate me, I don't need to know, I'm just glad if they help you).
> >
> > Though your mention of "unprivileged" does make me shiver a little:
> > Christian will understand the implications when I do not, but I wonder
> > if my effort to limit the memory usage of user xattrs to "inode space"
> > can be be undermined by unprivileged mounts with unlimited (or default,
> > that's bad enough) nr_inodes.
> >
> > If so, that won't endanger the tmpfs user xattrs implementation, since
> > the problem would already go beyond those: can an unprivileged mount of
> > tmpfs allow its user to gobble up much more memory than is good for the
> > rest of the system?
> 
> I don't actually know; I'm no expert in that area. That said, these
> tmpfses are themselves attached to an unprivileged mount namespace, so
> it would certainly be my assumption that in the case of an OOM
> condition, the OOM killer would keep trying to kill processes in that
> mount namespace until nothing else references it and all tmpfs mounts
> can be reclaimed, but then again, that's only my assumption and not
> necessarily reality.
> 
> That said, I got curious and decided to experiment; I booted a kernel
> in a VM with 1GiB of memory and ran the following commands:
> 
>     $ unshare -Umfr bash
>     # mount -t tmpfs tmp /mnt -o size=1g
>     # dd if=/dev/urandom of=/mnt/oversize bs=1M count=1000
> 
> After about a second, the OOM killer woke up and killed bash then dd,
> causing the mount namespace to be collected (and with it the tmpfs).
> So far, so good.
> 
> I got suspicious that what I was seeing was that these were the only
> reasonable candidates for the OOM killer, because there were no other
> processes in that VM besides them & init, so I modified slightly the
> experiment:
> 
>     $ dd if=/dev/zero of=/dev/null bs=10M count=10000000000 &
>     $ unshare -Umfr bash
>     # mount -t tmpfs tmp /mnt -o size=1g
>     # dd if=/dev/urandom of=/mnt/oversize bs=1M count=1000
> 
> The intent being that the first dd would have a larger footprint than
> the second because of the large block size, yet it shouldn't be killed
> if the tmpfs usage was accounted for in processes in the mount
> namespace. What happened however is that both the outer dd and the
> outer shell got terminated, causing init to exit and with it the VM.
> 
> So, it's likely that there's some more work to do in that area; I'd
> certainly expect the OOM killer to take the overall memory footprint
> of mount namespaces into account when selecting which processes to
> kill. It's also possible my experiment was flawed and not
> representative of a real-life scenario, as I clearly have interacted
> with misbehaving containers before, which got killed when they wrote
> too much to tmpfs. But then again, my experiment also didn't take
> memory cgroups into account.

So mount namespaces are orthogonal to that and they would be the wrong
layer to handle this.

Note that an unprivileged user (regular or via containers) on the system
can just exhaust all memory in various ways. Ultimately the container or
user would likely be taken down by in-kernel OOM or systemd-oomd or
similar tools under memory pressure.

Of course, all that means is that untrusted workloads need to have
cgroup memory limits. That also limits tmpfs instances and prevents
unprivileged user from using all memory.

If you don't set a memory limit then yes, the container might be able to
exhaust all memory but that's a bug in the container runtime. Also, at
some point the OOM killer or related userspace tools will select the
container init process for termination at which point all the namespaces
and mounts go away. That's probably what you experience as misbehaving
containers. The real bug there is probably that they're allowed to run
without memory limits in the first place.