On 2024/10/15 0:31, Jan Kara wrote:
On Fri 11-10-24 10:18:04, Baokun Li wrote:
On 2024/10/9 23:50, Jan Kara wrote:
Or go one step further and add a mechanism like xfs Reverse-Mapping, which
makes sure that allocated blocks do point to the target inode, which could
replace the current block_validity, and could also be used in future online
fscks.
Well, that is a rather big change. It requires significant on-disk format
change and also performance cost when to maintain. Furthermore for xattr
blocks which can be shared by many inodes it is not even clear how to
implement this... So I'm not sure we really want to do this either.
Yes, there can be a lot of work involved.
* Perhaps we could create an rmap file to store the rmap tree to avoid
on-disk format changes.
* The performance impact of maintaining rmap really needs to be evaluated,
perhaps by writing a simple DEMO to test it.
* XFS supports shared blocks(A.K.A. reflink.), so even if the physical
blocks are the same, but the inodes are different or the logical blocks
are different, they will be recorded multiple times in the tree. So the
shared xattr block can be handled similarly.
We have plans to support online fsck in the future, and implementing rmap
is one of the steps. Perhaps one can wait until rmap is implemented to
assess whether it is worth a strict check here.
Yes, we could implement something like this be as you wrote, it's going to
be a lot of work. We've briefly discussed this with Ted on ext4 call and we
came to a conclusion that this is a type of corruption ext4 may never
protect agaist. You simply should not mount arbitrarily corrupted
filesystems...
Thank you for discussing this with Ted.
This kind of problem is really tricky.
But if you want to try, sure go ahead :)
As server clusters get larger and larger, server maintenance becomes very
difficult. Therefore, timely detection of problems (i.e. online scanning,
similar to e2fsck -fn) and timely and non-stop fixing of problems (i.e.
online fsck, similar to e2fsck -a) have always been the requirements of
our customers. Thus online fsck has been on our TODO list, and it's really
time to start doing it. 😀
One relatively easy solution to similar class of problems would be to store
the type of metadata buffer inside the buffer_head when we are verifying
checksum, clear the info when freeing the block in __ext4_forget(), and
fail with EFSCORRUPTED error when one type -> another type transition would
happen.
This solution looks great. If we save checksum further, we can sense not
only the change of block type, but also the change of block data.
But now journal_head takes up bh->b_private, so to record information,
we need to add new fields to buffer_head (e.g. b_info), or modify the logic
of journal_head (e.g. make bh->b_private hold ext4_bufdata, and include the
journal_head in ext4_bufdata. ocfs2 needs a similar adaptation).
Implementing rmap may take some time, until then we can avoid the problem
as much as possible by checking the magic and xattr block csum.
Maybe something like this?
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 7647e9f6e190..cd3ae1e3371c 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1676,6 +1676,13 @@ static int ext4_xattr_set_entry(struct
ext4_xattr_info *i,
}
}
+ if (WARN_ON_ONCE(last < here)) {
+ EXT4_ERROR_INODE(inode, "corrupted xattr entries in %s",
+ is_block ? "block" : "ibody");
+ ret = -EFSCORRUPTED;
+ goto out;
+ }
+
/* Check whether we have enough space. */
if (i->value) {
size_t free;
@@ -1923,6 +1930,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode
*inode,
}
if (s->base) {
+ struct ext4_xattr_header *hdr;
int offset = (char *)s->here - bs->bh->b_data;
BUFFER_TRACE(bs->bh, "get_write_access");
@@ -1932,6 +1940,16 @@ ext4_xattr_block_set(handle_t *handle, struct inode
*inode,
goto cleanup;
lock_buffer(bs->bh);
+ hdr = header(s->base);
+
+ if (hdr->h_magic != cpu_to_le32(EXT4_XATTR_MAGIC) ||
+ (ext4_has_metadata_csum(inode->i_sb) &&
+ (ext4_xattr_block_csum(inode, bs->bh->b_blocknr, hdr)
!=
+ hdr->h_checksum))) {
+ unlock_buffer(bs->bh);
+ error = -EFSCORRUPTED;
+ goto bad_block;
+ }
if (header(s->base)->h_refcount == cpu_to_le32(1)) {
__u32 hash = le32_to_cpu(BHDR(bs->bh)->h_hash);
Hum, there are more places in xattr code that access a buffer that could
have been modified. So why do you add check into this place? Is it somehow
special?
Honza
The out-of-bounds access occurs here because the last dentry obtained
in ext4_xattr_set_entry() is not the same as the last dentry obtained
in ext4_xattr_block_find().
When we modify an xattr, we always hold the inode lock, so the ibody
case is not considered. Check_xattrs() is called to do a full check
when looking up an xattr, so it's not necessary to consider that either.
When inserting an xattr into an xattr block, there are three cases:
* xattr block is newly allocated, and the block is allocated only
after the entry is set in memory.
* xattr block is unshared, insert the new xattr directly.
* xattr block is shared, copy the xattr block and insert the new xattr.
Only in the last two cases can a multiply claimed xattr block result in
out-of-bounds access, so the buffer lock is held to verify that the data
is correct.
(It looks like kmemdup should be moved to the buffer lock here as well.🤔)
Thanks again!
Regards,
Baokun