On Mon, Sep 23, 2024 at 12:30:17 +0200, Lennart Poettering wrote: > /run/ is only mounted by systemd if it is not pre-mounted already by > the container manager. We generally assume the container manager does > that (for example systemd-nspawn does it that way), already because > /run/host/ is the mechanism to pass outside info/resources into the > container in a systemd world, hence it really needs to be premounted. Just for the record, as I've been investigating similar issue. systemd-nspawn does premount several tmpfses, but exposes similar behaviour to the OP-reported one. Accordingly to the values specified in https://github.com/systemd/systemd/blob/main/src/basic/mountpoint-util.h containers end up with: /dev/shm /tmp and other using NESTED_TMPFS_LIMITS: size=10% of the HOST RAM /run using TMPFS_LIMITS_RUN having size=20% As there's no user quota applied, and (at least for PrivateUsers= containers) systemd-remount-fs cannot remount these mountpoints, all such containers are vulnerable to unprivileged user DoS (OOM). Only the /dev is protected against root mistakes (like cat /dev/zero > /dev/nul). It would be nice to have these percent values being resolved against container-restricted memory (like manually recalculating sizes using MemoryMax= value), but as a band-aid solution I've came up with following service template Wanted After nspawn: [Unit] Description=Remount sanely tmpfs fses inside systemd-nspawn@%i After=systemd-nspawn@%i.service [Service] Type=oneshot ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=1G,noexec /dev/shm 2>/dev/null' ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=2G /tmp 2>/dev/null' ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=2G /run 2>/dev/null' SyslogIdentifier=nspawn-remount-tmpfses@%i Above commands return EPERM from mount_setattr(), fortunately fsconfig(4, FSCONFIG_SET_STRING, "size", "1G", 0) is called before that and apparently works. I use this method (nsenter) to alter nspawn configuration, that has no appropriate options in nspawn itself and is forbidden inside container (when unprivileged, despite namespaced), e.g.: nsenter -t [...] -U -F sysctl -w user.max_user_namespaces=0 to reduce kernel attack surface from within not-so-trusted containers. -- Tomasz Pala <gotar@xxxxxxxxxxxxx>