Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs

Adrian Vovk <adrianvovk@xxxxxxxxx> · Thu, 15 Feb 2024 20:14:46 -0500

On 2/15/24 18:17, Dave Chinner wrote:
On Thu, Feb 15, 2024 at 02:46:52PM -0500, Adrian Vovk wrote:
On 2/15/24 08:57, Jan Kara wrote:
On Mon 29-01-24 19:13:17, Adrian Vovk wrote:
Hello! I'm the "GNOME people" who Christian is referring to
Got back to thinking about this after a while...

On 1/17/24 09:52, Matthew Wilcox wrote:
I feel like we're in an XY trap [1].  What Christian actually wants is
to not be able to access the contents of a file while the device it's
on is suspended, and we've gone from there to "must drop the page cache".
What we really want is for the plaintext contents of the files to be gone
from memory while the dm-crypt device backing them is suspended.

Ultimately my goal is to limit the chance that an attacker with access to a
user's suspended laptop will be able to access the user's encrypted data. I
need to achieve this without forcing the user to completely log out/power
off/etc their system; it must be invisible to the user. The key word here is
limit; if we can remove _most_ files from memory _most_ of the time Ithink
luksSuspend would be a lot more useful against cold boot than it is today.
Well, but if your attack vector are cold-boot attacks, then how does
freeing pages from the page cache help you? I mean sure the page allocator
will start tracking those pages with potentially sensitive content as free
but unless you also zero all of them, this doesn't help anything against
cold-boot attacks? The sensitive memory content is still there...

So you would also have to enable something like zero-on-page-free and
generally the cost of this is going to be pretty big?
Yes you are right. Just marking pages as free isn't enough.

I'm sure it's reasonable enough to zero out the pages that are getting
free'd at our request. But the difficulty here is to try and clear pages
that were freed previously for other reasons, unless we're zeroing out all
pages on free. So I suppose that leaves me with a couple questions:

- As far as I know, the kernel only naturally frees pages from the page
cache when they're about to be given to some program for imminent use.
Memory pressure does cause cache reclaim. Not just page cache, but
also slab caches and anything else various subsystems can clean up
to free memory..

But
then in the case the page isn't only free'd, but also zero'd out before it's
handed over to the program (because giving a program access to a page filled
with potentially sensitive data is a bad idea!). Is this correct?
Memory exposed to userspace is zeroed before userspace can access
it.  Kernel memory is not zeroed unless the caller specifically asks
for it to be zeroed.

- Are there other situations (aside from drop_caches) where the kernel frees
pages from the page cache? Especially without having to zero them anyway? In
truncate(), fallocate(), direct IO, fadvise(), madvise(), etc. IOWs,
there are lots of runtime vectors that cause page cache to be freed.

other words, what situations would turning on some zero-pages-on-free
setting actually hurt performance?
Lots.  page contents are typically cold when the page is freed so
the zeroing is typically memory latency and bandwidth bound. And
doing it on free means there isn't any sort of "cache priming"
performance benefits that we get with zeroing at allocation because
the page contents are not going to be immediately accessed by the
kernel or userspace.

- Does dismounting a filesystem completely zero out the removed fs's pages
from the page cache?
No. It just frees them. No explicit zeroing.
I see. So even dismounting a filesystem and removing the device 
completely doesn't fully protect from a cold-boot attack. Good to know.

- I remember hearing somewhere of some Linux support for zeroing out all
pages in memory if they're free'd from the page cache. However, I spent a
while trying to find this (how to turn it on, benchmarks) and I couldn't
find it. Do you know if such a thing exists, and if so how to turn it on?
I'm curious of the actual performance impact of it.
You can test it for yourself: the init_on_free kernel command line
option controls whether the kernel zeroes on free.

Typical distro configuration is:

$ sudo dmesg |grep auto-init
[    0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
$

So this kernel zeroes all stack memory, page and heap memory on
allocation, and does nothing on free...

I see. Thank you for all the information.

So ~5% performance penalty isn't trivial, especially to protect against 
something rare/unlikely like a cold-boot attack, but it would be quite 
nice if we could have some semblance of effort put into making sure the 
data is actually out of memory if we claim that we've done our best to 
harden the system against this scenario. Again, I'm all for best-effort 
solutions here; doing 90% is better than doing 0%...

I've got an alternative idea. How feasible would a second API be that 
just goes through free regions of memory and zeroes them out? This would 
be something we call immediately after we tell the kernel to drop 
everything it can relating to a given filesystem. So the flow would be 
something like follows:

1, user puts systemd-homed into this "locked" mode, homed wipes the 
dm-crypt key out of memory and suspends the block device (this already 
exists)
2. homed asks the kernel to drop whatever caches it can relating to that 
filesystem (the topic of this email thread)
3. homed asks the kernel to zero out all unallocated memory to make sure 
that the data is really gone (the second call I'm proposing now).

Sure this operation can take a while, but for our use-cases it's 
probably fine. We would do this only in response to a direct user action 
(and we can show a nice little progress spinner on screen), or right 
before suspend. A couple of extra seconds of work while entering suspend 
isn't going to be noticed by the user. If the hardware supports 
something faster/better to mitigate cold-boot attacks, like memory 
encryption / SEV, then we'd prefer to use that instead of course, but 
for unsupported hardware I think just zeroing out all the memory that 
has been marked free should do the trick just fine.

By the way, something like cryptsetup might want to use this second API 
too to ensure data is purged from memory after it closes a LUKS volume, 
for instance. So for example if you have an encrypted USB stick you use 
on your computer, the data really gets wiped after you unplug it.

-Dave.