[LSF/MM/BPF TOPIC] Dropping page cache of individual fs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey,

I'm not sure this even needs a full LSFMM discussion but since I
currently don't have time to work on the patch I may as well submit it.

Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The
STF was created by the German government to fund public infrastructure:

"The Sovereign Tech Fund supports the development, improvement and
 maintenance of open digital infrastructure. Our goal is to sustainably
 strengthen the open source ecosystem. We focus on security, resilience,
 technological diversity, and the people behind the code." (cf. [1])

Gnome has proposed various specific projects including integrating
systemd-homed with Gnome. Systemd-homed provides various features and if
you're interested in details then you might find it useful to read [2].
It makes use of various new VFS and fs specific developments over the
last years.

One feature is encrypting the home directory via LUKS. An approriate
image or device must contain a GPT partition table. Currently there's
only one partition which is a LUKS2 volume. Inside that LUKS2 volume is
a Linux filesystem. Currently supported are btrfs (see [4] though),
ext4, and xfs.

The following issue isn't specific to systemd-homed. Gnome wants to be
able to support locking encrypted home directories. For example, when
the laptop is suspended. To do this the luksSuspend command can be used.

The luksSuspend call is nothing else than a device mapper ioctl to
suspend the block device and it's owning superblock/filesystem. Which in
turn is nothing but a freeze initiated from the block layer:

dm_suspend()
-> __dm_suspend()
   -> lock_fs()
      -> bdev_freeze()

So when we say luksSuspend we really mean block layer initiated freeze.
The overall goal or expectation of userspace is that after a luksSuspend
call all sensitive material has been evicted from relevant caches to
harden against various attacks. And luksSuspend does wipe the encryption
key and suspend the block device. However, the encryption key can still
be available clear-text in the page cache. To illustrate this problem
more simply:

truncate -s 500M /tmp/img
echo password | cryptsetup luksFormat /tmp/img --force-password
echo password | cryptsetup open /tmp/img test
mkfs.xfs /dev/mapper/test
mount /dev/mapper/test /mnt
echo "secrets" > /mnt/data
cryptsetup luksSuspend test
cat /mnt/data

This will still happily print the contents of /mnt/data even though the
block device and the owning filesystem are frozen because the data is
still in the page cache.

To my knowledge, the only current way to get the contents of /mnt/data
or the encryption key out of the page cache is via
/proc/sys/vm/drop_caches which is a big hammer.

My initial reaction is to give userspace an API to drop the page cache
of a specific filesystem which may have additional uses. I initially had
started drafting an ioctl() and then got swayed towards a
posix_fadvise() flag. I found out that this was already proposed a few
years ago but got rejected as it was suspected this might just be
someone toying around without a real world use-case. I think this here
might qualify as a real-world use-case.

This may at least help securing users with a regular dm-crypt setup
where dm-crypt is the top layer. Users that stack additional layers on
top of dm-crypt may still leak plaintext of course if they introduce
additional caching. But that's on them.

Of course other ideas welcome.

[1]: https://www.sovereigntechfund.de/en
[2]: https://systemd.io/HOME_DIRECTORY
[3]: https://lore.kernel.org/linux-btrfs/20230908-merklich-bebauen-11914a630db4@brauner/
[4]: A bdev_freeze() call ideally does the following:

     (1) Freeze the block device @bdev
     (2) Find the owning superblock of the block device @bdev and freeze the
         filesystem as well.

     Especially (2) wasn't true for a long time. Filesystems would only be
     able to freeze the filesystems on the main block device. For example, an
     xfs filesystem using an external log device would not be able to be
     frozen if the block layer request came via the external log device. This
     is fixed since v6.8 for all filesystems using appropriate holder
     operations.

     Except for btrfs where block device initiated freezes don't work at all;
     not even for the main block device. I've pointed this out months ago in [3].

     Which is why we currently can't use btrfs with LUKS2 encryption as as
     luksSuspend call will leave the filesystem unfrozen.
[5]: https://gitlab.com/cryptsetup/cryptsetup/-/issues/855
     https://gitlab.gnome.org/Teams/STF/homed/-/issues/23




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux