On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote: > On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote: > > Hey, > > > > I'm not sure this even needs a full LSFMM discussion but since I > > currently don't have time to work on the patch I may as well submit it. > > > > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The > > STF was created by the German government to fund public infrastructure: > > > > "The Sovereign Tech Fund supports the development, improvement and > > maintenance of open digital infrastructure. Our goal is to sustainably > > strengthen the open source ecosystem. We focus on security, resilience, > > technological diversity, and the people behind the code." (cf. [1]) > > > > Gnome has proposed various specific projects including integrating > > systemd-homed with Gnome. Systemd-homed provides various features and if > > you're interested in details then you might find it useful to read [2]. > > It makes use of various new VFS and fs specific developments over the > > last years. > > > > One feature is encrypting the home directory via LUKS. An approriate > > image or device must contain a GPT partition table. Currently there's > > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is > > a Linux filesystem. Currently supported are btrfs (see [4] though), > > ext4, and xfs. > > > > The following issue isn't specific to systemd-homed. Gnome wants to be > > able to support locking encrypted home directories. For example, when > > the laptop is suspended. To do this the luksSuspend command can be used. > > > > The luksSuspend call is nothing else than a device mapper ioctl to > > suspend the block device and it's owning superblock/filesystem. Which in > > turn is nothing but a freeze initiated from the block layer: > > > > dm_suspend() > > -> __dm_suspend() > > -> lock_fs() > > -> bdev_freeze() > > > > So when we say luksSuspend we really mean block layer initiated freeze. > > The overall goal or expectation of userspace is that after a luksSuspend > > call all sensitive material has been evicted from relevant caches to > > harden against various attacks. And luksSuspend does wipe the encryption > > key and suspend the block device. However, the encryption key can still > > be available clear-text in the page cache. > > The wiping of secrets is completely orthogonal to the freezing of > the device and filesystem - the freeze does not need to occur to > allow the encryption keys and decrypted data to be purged. They > should not be conflated; purging needs to be a completely separate > operation that can be run regardless of device/fs freeze status. Yes, I'm aware. I didn't mean to imply that these things are in any way necessarily connected. Just that there are use-cases where they are. And the encrypted home directory case is one. One froze the block device and filesystem one would now also like to drop the page cache which has most of the interesting data. The fact that after a block layer initiated freeze - again mostly a device mapper problem - one may or may not be able to successfully read from the filesystem is annoying. Of course one can't write, that will hang one immediately. But if one still has some data in the page cache one can still dump the contents of that file. That's at least odd behavior from a users POV even if for us it's cleary why that's the case. And a freeze does do a sync_filesystem() and a sync_blockdev() to flush out any dirty data for that specific filesystem. So it would be fitting to give users an api that allows them to also drop the page cache contents. For some use-cases like the Gnome use-case one wants to do a freeze and drop everything that one can from the page cache for that specific filesystem. And drop_caches is a big hammer simply because there are workloads where that isn't feasible. Even on a modern boring laption system one may have lots of services. On a large scale system one may have thousands of services and they may all uses separate images (And the border between isolated services and containers is fuzzy at best.). And here invoking drop_caches penalizes every service. One may want to drop the contents of _some_ services but not all of them. Especially during suspend where one cares about dropping the page cache of the home directory that gets suspended - encrypted or unencrypted. Ignoring the security aspect itself. Just the fact that one froze the block device and the owning filesystem one may want to go and drop the page cache as well without impacting every other filesystem on the system. Which may be thousands. One doesn't want to penalize them all. Ignoring the specific use-case I know that David has been interested in a way to drop the page cache for afs. So this is not just for the home directory case. I mostly wanted to make it clear that there are users of an interface like this; even if it were just best effort. > > FWIW, focussing on purging the page cache omits the fact that > having access to the directory structure is a problem - one can > still retrieve other user information that is stored in metadata > (e.g. xattrs) that isn't part of the page cache. Even the directory > structure that is cached in dentries could reveal secrets someone > wants to keep hidden (e.g code names for operations/products). Yes, of course but that's fine. The most sensitive data and the biggest chunks of data will be the contents of files. We don't necessarily need to cater to the paranoid with this. > > So if we want luksSuspend to actually protect user information when > it runs, then it effectively needs to bring the filesystem right > back to it's "just mounted" state where the only thing in memory is > the root directory dentry and inode and nothing else. Yes, which we know isn't feasible. > > And, of course, this is largely impossible to do because anything > with an open file on the filesystem will prevent this robust cache > purge from occurring.... > > Which brings us back to "best effort" only, and at this point we > already have drop-caches.... > > Mind you, I do wonder if drop caches is fast enough for this sort of > use case. It is single threaded, and if the filesystem/system has > millions of cached inodes it can take minutes to run. Unmount has > the same problem - purging large dentry/inode caches takes a *lot* > of CPU time and these operations are single threaded. > > So it may not be practical in the luks context to purge caches e.g. > suspending a laptop shouldn't take minutes. However laptops are > getting to the hundreds of GB of RAM these days and so they can > cache millions of inodes, so cache purge runtime is definitely a > consideration here. I'm really trying to look for a practical api that doesn't require users to drop the caches for every mounted image on the system. FYI, I've tried to get some users to reply here so they could speak to the fact that they don't expect this to be an optimal solution but none of them know how to reply to lore mboxes so I can just relay information.