Re: [MAINTAINERS/KERNEL SUMMIT] Trust and maintenance of file systems

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 20 Sep 2023 08:57:19 +1000

On Tue, Sep 19, 2023 at 06:17:21AM +0100, Matthew Wilcox wrote:
> On Tue, Sep 19, 2023 at 11:15:54AM +1000, Dave Chinner wrote:
> > This was easy to do with iomap based filesystems because they don't
> > carry per-block filesystem structures for every folio cached in page
> > cache - we carry a single object per folio that holds the 2 bits of
> > per-filesystem block state we need for each block the folio maps.
> > Compare that to a bufferhead - it uses 56 bytes of memory per
> > fielsystem block that is cached.
> 
> 56?1  What kind of config do you have?  It's 104 bytes on Debian:
> buffer_head          936   1092    104   39    1 : tunables    0    0    0 : slabdata     28     28      0
> 
> Maybe you were looking at a 32-bit system; most of the elements are
> word-sized (pointers, size_t or long)

Perhaps so, it's been years since I actually paid attention to the
exact size of a bufferhead (XFS completely moved away from them back
in 2018). Regardless, underestimating the size of the bufferhead
doesn't materially change the reasons iomap is a better choice for
filesystems running on modern storage hardware...

> > So we have to consider that maybe it is less work to make high-order
> > folios work with bufferheads. And that's where we start to get into
> > the maintenance problems with old filesysetms using bufferheads -
> > how do we ensure that the changes for high-order folio support in
> > bufferheads does not break the way one of these old filesystems
> > that use bufferheads?
> 
> I don't think we can do it.  Regardless of the question you're proposing
> here, the model where we complete a BIO, then walk every buffer_head
> attached to the folio to determine if we can now mark the folio as being
> (uptodate / not-under-writeback) just doesn't scale when you attach more
> than tens of BHs to the folio.  It's one bit per BH rather than having
> a summary bitmap like iomap has.

*nod*

I said as much earlier in the email:

"The pointer chasing model per-block bufferhead iteration requires
to update state and retrieve mapping information just does not scale
to marshalling millions of objects a second through the page cache."

> I have been thinking about spitting the BH into two pieces, something
> like this:
> 
> struct buffer_head_head {
> 	spinlock_t b_lock;
> 	struct buffer_head *buffers;
> 	unsigned long state[];
> };
> 
> and remove BH_Uptodate and BH_Dirty in favour of setting bits in state
> like iomap does.

Yes, that woudl make it similar to the way iomap works, but I think
that then creates more problems in that bufferhead state is used for
per-block locking and blocking waits. I don't really want to think
about much more how complex stuff like __block_write_full_folio()
becomes with this model...

> But, as you say, there are a lot of filesystems that would need to be
> audited and probably modified.

Yes, this is the common problem all these "modernise old API" ideas
end up at - this is the primary issue that needs to be sorted out,
and we're no closer to that now than when the thread started.

We can deal with this problem for filesystems that we can test. For
stuff we can't test and verify, then we really have to start
considering the larger picture around shipping unverified code to
users.

Go read this article on LWN about new EU laws for software
development that aren't that far off being passed into law:

https://lwn.net/Articles/944300/

And it's clear that there are also current policy discussions going
through the US federal government that are, most likely, going to
end up in a similar place with respect to secure development
practices for critical software infrastructure like the Linux
kernel.

Now combine that with this one about the problem of bogus CVEs
(which could have been written about syzbot and filesystems!):

https://lwn.net/Articles/944209/

And it's pretty clear that the current issues with unmaintained code
will only get worse from here. All it will take is a CVE to be
issued on one of these unmaintained filesystems, and the safest
thing for us to do will be to remove the code to remove all
potential liability for it.

The basic message is that we aren't going to be able to ignore code
that we can't substantially verify for much longer.  We simply won't
have a choice about the code we ship: if is not testable and
verified to the best of our abilities then nobody will risk
shipping it regardless of whether they have users or not.

That's the model the cybersecurity-industrial complex is pushing us
towards whether we like it or not. If this is the future in which we
develop software, then this has substantial impact on any discussion
about how to manage old unmaintained, untestable code in any project
we work on, not just the Linux kernel...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx