On Tue, Oct 15, 2024 at 12:04:04PM -0700, Darrick J. Wong wrote: > On Tue, Oct 15, 2024 at 05:55:23PM +1100, Dave Chinner wrote: > > On Thu, Oct 10, 2024 at 05:49:40PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > Create a xfs_metafile_iget function for metadata inodes to ensure that > > > when we try to iget a metadata file, the inobt thinks a metadata inode > > > is in use and that the metadata type matches what we are expecting. > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > Reviewed-by: Christoph Hellwig <hch@xxxxxx> > > > --- > > ..... > > > > > +/* > > > + * Get a metadata inode. The metafile @type must match the inode exactly. > > > + * Caller must supply a transaction (even if empty) to avoid livelocking if the > > > + * inobt has a cycle. > > > > Is this something we have to be concerned with for normal operation? > > I think we should be more concerned about this everywhere. Remember > that guy who fuzzed a btree so that the lower levels pointed "down" to > an upper level btree block? Yes, but that's kinda my point: it requires malicious tampering to create these loops. Once someone can tamper with the data held on the device, it's game over. It's far easier to trojan and setuid a file that the user will run (or be tricked into running) and get break into the system that way than it is to exploit a flaw in filesystem metadata. There's every chance that the metadata exploit will be detected and prevented, but no filesystem code can check user data for trojans or modified-but-valid inode ownership and permissions. In all my time working on XFS, I've never seen an unintentionally corrupted filesystem with an cycle in btree metadata. I guess it could happen if an XFS bug slipped through testing, but that seems like a very unlikely occurence. Hence, based on my experience, btree cycles are mainly an academic concern, not an actual security or reliability issue that we really need to be concerned about. > Detection and recovery from that is what > running with xfs_trans_alloc_empty() buys us. It's not a general cycle > detector as you point out below... It's not a detection strategy at all - it's a workaround. When we traverse the btree, we know all the higher level btree blocks we have locked already - they are held in the cursor. However, we don't check that the btree ptr we are about to follow is already locked at a higher level in the cursor before we try to read the buffer from disk. Hence the workaround to rely on the transaction allowing nested access to locked buffers so that we can skip the second buffer lookup, then decode the btree block again and notice that the level in the btree block header is wrong. If we descend to a non-ancestor higher layer btree block, then there is no deadlock potential, we get the locked buffer, decode it and then we find it is from the wrong level and abort. IOWs, we *should* be checking the full btree path cursor for recursion before we follow *any* btree pointer rather than relying on the transaction structure to prevent deadlock before we can detect we descended to the wrong layer. > > We don't care about needing to detect inobt cycles this when doing > > lookups from the VFS during normal operation, so why is this an > > issue we have to be explicitly concerned about during metadir > > traversals? > > > > Additionally, I can see how a corrupt btree ptr loop could cause a > > *deadlock* without a transaction context (i.e. end up waiting on a > > lock we already hold) but I don't see what livelock the transaction > > context prevents. It appears to me that it would only turn the > > deadlock into a livelock because the second buffer lookup will find > > the locked buffer already linked to the transaction and simply take > > another reference to it. Hence it will just run around the loop of > > buffers taking references forever (i.e. a livelock) instead of > > deadlocking. > > ...because we don't detect loops within a particular level of a btree. > In theory this is possible if you are willing to enhance (for example) > a XFS_BB_RIGHTSIB move by comparing the rightmost record in the old > block against the leftmost record in the new block, but right now the > codebase (except for scrub) doesn't do that. So add that detection to the btree increment and decrement operations. We also need to perform the same pointer following checks as for descending - the sibling pointer traversal could ascend or descend the the tree to a locked buffer in the path cursor. Descent needs to be checked, because we could be walking siblings at an interior level (e.g. inserting new pointers back up the path) Fundamentally, cycle detection on search traversal does not require transaction contexts to perform or unwind. The btree search code tracks all the btree blocks the traversal has locked at any time. hence we can detect direct recursion without any risk of deadlocks by walking the path cursor. We can also unwind correctly if we find a cycle simply by releasing the path cursor. All the more complex cases with dangling locked btree buffers occur when modifications are in progress. In these cases we always have a transaction context, so tracking all the non-path locked buffers for buffer lock based recursion deadlock protection happens naturally for these cases. > Maybe someone'll do that some day; it would close a pretty big hole in > runtime problem detection. I'd much prefer that we either: 1) fix the runtime detection for cycles properly because that avoids the need to transaction contexts for pure lookups completely; or 2) stop adding complexity and overhead for random one-off operations because it's only a partial workaround and doesn't provide us any real protection against malicious cycle-inducing attacks. > > Another question: why are we only concerned cycles in the inobt? If > > we've got a cycle in any other btree the metadir code might interact > > with (e.g. directories, free space, etc), we're going to have the > > same problems with deadlocks and/or livelocks on tree traversal. > > We /do/ hold empty transactions when calling xfs_metadir_load so that we > can error out on wrong-level types of dabtree cycles instead of hanging > the kernel. The dabtree code has a full path cursor and so can unwind correctly when level cycles are detected on lookup. So, again, we should be fixing this by using the path cursor to detect direct path cycles that would deadlocki along with level+sibling order checks to find other non-path based cycles. > > > - sbp = &mp->m_sb; > > > - error = xfs_iget(mp, NULL, sbp->sb_rbmino, 0, 0, &mp->m_rbmip); > > > > .... and it's clear that we currently treat the superblock inodes as > > trusted inode numbers. We don't validate that they are allocated > > when we look them up, we trust that they have been correctly > > allocated and properly referenced on disk. > > > > It's not clear to me why there's an undocumented change of > > trust for internal, trusted access only on-disk metadata being made > > here... > > I feel it prudent to check for inconsistencies between the inobt state > and the inode records themselves, because bad space metadata can cause > disproportionately more problems in a filesystem than a regular file's > metadata. Yet we still implicitly trust the contents of AGFs, AGI, the free space btrees, etc despite the fact the same arguments can be made about them.... Regardless of this inconsistency, let's follow this "we need more validation because the metadata is more important" argument through to it's logical conclusion. Essentially, the argument is that we can't trust the metadir directory contents because there might be some new inconsistency in the allocated inode btree that might have marked the metadir inode as free. i.e. we are trying to protect against either XFS implementation bugs, storage media failure or a malicious corruption. Let's address the XFS implementation bug aspect, which is essentially the same as having a CRC error in an inobt block: detecting an inobt corruption at runtime will result in filesystem shutdown regardless of whether metadir is in use or not. Mounting the filesystem again will either succeed or fail depending on whether the inobt corruption is tripped over during the mount. Currently that only depends on unlinked inode recovery, metadir would add another inobt traversal path before we perform unlinked inode recovery. Either way, the traversals may not encounter the corruption that shut the filesystem down, so the mount can still succeed (or fail) regardless of whether metadir exists or whether it does a inobt lookup for it's internal inodes. Repair, however, is still necessary to resolve the corruption. IOWs, doing an inobt check via an untrusted lookup doesn't actually change anything material w.r.t. inobt corruption. We're entirely unlikely to detect an otherwise unknown inobt corruption with this check at mount time. If we already have an inobt corruption then the untrusted lookup isn't guaranteed to encounter it and prevent metadir instantiation or mount. It's kinda insane to say we don't trust the inode number from a verified directory structure, but we will absolutely trust in a known corrupted inobt to validate that inode number! i.e. if we are going to perform untrusted inode lookups under the guise of "this metadata is very important to verify", then we need to first verify that the -entire- inobt and its inobt records we are verifying the inode number against is are not corrupt first. i.e. if we don't trust our own inode numbers, then we've got to scrub a good chunk of the filesystem metadata before we can use the inobt as a source of trust to validate those inode numbers. As for the malicious tampering case, we can't trust any metadata in the filesystem to be consistent. IOWs, if we can't trust that the metadir inode pointers and the inode btree is consistent because someone might have laid a cycle-based deadlock trap in the inobt, then we can't trust that the block mapping btree in a metadir inode is consistent for the same reasons. Not only that, it could point to any attacker controlled block in the filesystem and not to the metadata it is supposed to contain, right? So now we have to validate the BMBT is consistent before we use the metadata the metadir points to. To do that, we have to check that the free space and rmap btree records are consistent with what the BMBT says, and then we have to check that the refcount btree is consistent with what both the BMBT and RMAPBT say about that block. So if these all check out, we're good, right? Well, no, we aren't. The attacker could have tampered with all of these btrees, making the runtime checks look consistent when in fact they hide the fact that the metadir contains attacker controlled data. Of course, we haven't validated any of the non-metadir metadata in the filesystem, but the attacker has control of that, too. They could have manipulated that metadata to cross-link metadir contents with user files and directories. This could give userspace direct control of metadir file contents. And that leads us to the point of needing a full filesystem metadata scrub at mount time to validate that the metadir structure has not been tampered with in any way nor does any of the external metadata have illegal references into the metadir structure. ----- Regardless of the source of metadir/inobt corruption, if we are going to consider this "more important metadata", then we need to ensure that our verification model reflects that. There's a fundamental principle involved here. The moment we say "we cannot trust this bit of our own metadata", then we have to acknowledge that we cannot trust *any* of our own metadata that we store in the same context. The metadir inodes, directory blocks and inobt blocks we are talking about here can be -physically located- in nearby blocks in the storage device. The LBAs might even be close enough together that they get placed in the same erase block in flash storage. Hence we cannot make any assertions that there are differences in trust, reliability, recoverability or robustness between the different types of metadata we store in the filesystem. Hence we have a simple choice: either we trust our internal metadata to be consistent with other internal metadata, or we don't trust anything. The XFS verification architecture is firmly planted in the "we trust internal metadata to be consistent" corner. This, at the most basic level, means all filesystem metadata is as trustworthy as any other filesystem metadata until proven otherwise.... Hence from a architectural point of view, arbitrary implementation declarations that "we don't trust this metadata but we will trust that metadata" make little sense. > Unlike regular files that can be unloaded and reloaded at will, we > only do this once, so the overhead of the inobt walk is amortized > over the entire mount-lifetime of the filesystem because we don't > irele metadata inodes. I suspect the need for dynamic metadir file behaviour won't be that far away. At minimum, we're going to have to be able to dynamically unlink and free rtg inodes if we want to shrink RT volumes. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx