On Mon, Dec 19, 2022 at 11:41:11AM -0500, Theodore Ts'o wrote: > On Mon, Dec 19, 2022 at 05:23:18PM +0800, Jun Nie wrote: > > > > Do you mean we have a chance to expand ea_inode in place for some > > cases? If so, a new ea_inode with larger space should be created > > to hold expanded ea_inode data, thus data have to be copied and written > > out through memory in my mind. Or anything other than CPU/memory can > > utilized for this to avoid memory usage, such as DMA? > > There are two inodes in question here. The first is the base inode, > which in this case is /file0. The second is the ea_inode which stores > the value of one of the extended attributes. In the syzkaller fuzzed > file system, there is an ea_inode field which is already created; it > contains a value which is too large to fit in the inode or the > extended attribute block; but that's OK, because we can put it in a > ea_inode. Unfortunately, we are unnecessarily created and deleting > the ea_inode (which contains the xattr *value*) when we move the xattr > from in-inode storage to the external xattr block. > > Extended attributes can be stored either in the on-disk inode, or in > an extended attribute block. The storage in the on-disk inode is > limited, but extended attributes stored don't require a random access > 4k read as in the case of the extended attribute block. So we try to > store extended attributes in the inode if possible --- especially the > ones which might be accessed frequently, such as a POSIX ACL or a > SELinux security id. > > The ext4 inode is composed of two portions. The base 128 byte inode, > which is always present, and which is what was originally used in > ext2. And the "extra inode fields", which are these fields as > currently defined at the end of struct ext4_inode: > > __le16 i_extra_isize; > __le16 i_checksum_hi; /* crc32c(uuid+inum+inode) BE */ > __le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */ > __le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */ > __le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */ > __le32 i_crtime; /* File Creation time */ > __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */ > __le32 i_version_hi; /* high 32 bits for 64-bit version */ > __le32 i_projid; /* Project ID */ > > The i_extra_isize is the field that is always present for inodes > larger than 128 bytes and for which the ext4 file system feature > EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE is enabled. The i_extra_isize > field tells us how many of the fields beyond the first 128 are > present. These fields are necessary for various "advanced" (newer > than ext2) ext4 features, including metadata checksums, support for > dates beyond 2038 and sub-second timestamp granularity, file creation > time, 64-bit i_version, and the project id for project quotas. > Everything beyond i_extra_isize is used for in-inode extended > attributes. > > Now, what if we need to add extra space for new ext4 features? Well, > ext4 has a way of expanding the extra inode fields, and one of the > ways to trigger this is via the debugging mount option, > debug_want_extra_isize. In this particular syzbot reproducer, the > mount option, "debug_want_extra_isize=128" sets the i_extra_isize > field to maximum allowable size for a 256 byte inode size, and this > means that all extended attributes should be ejected out from in-inode > storage to the external extended attribute block. We do this on a > best efforts bases, when a modified inode is written back to the disk. > > The lazytime mount option delays inode updates until the very last > minute. The reason for this is to avoid multiple writes to the inode > table blocks. This improves performance by reducing random 4k writes, > and for flash based storage, reducing flash wearout for flash-based > storage. For hard drives (HDD's), it reducing random 4k writes > reduces the need to perform Adjacent Track Interference (ATI) > mitigations. ATI mitigations can significantly increase the 99.9 > percentile tail latency on file system operations, and decreasing tail > latency can be worth $$$ for some use cases[1]. > > [1] https://research.google/pubs/pub44830 > > The downside of using lazytime updates is that on a crash, the inode > timestamps might not get updated --- but very often, this is not a big > deal. And normally, when some other inode in the same inode table > block is updated, we take that opportunity to update all of the > timestamps that were deferred. Or, in the worst case, this will get > delayed until the file system is unounted. > > Now, back to the extra space expansion. Eexpanding to allow extra > inode fields to be used in the future is a "nice to have" sort of > thing. It can fail for a number of reasons, including there not being > enough space in the extended attribute block to evict the extended > attributes in the inode; or if the file system is full, we might not > be able to allocate an external block for the extended attribute block > in the first place. > > So it's OK for us to simply pass on making space for the extra inode > fields if it turns out we happen to be in the process of unmounting > file system. However, that doesn't fix the performance problem of > unnecessarily deleting and creating the ea_inode when moving the xattr > from the inode to the exernal xattr block. So fixing that performance > issue is the ideal solution. Simply passing on the extra_isize > expansion is the second best issue. Backporting an unrelated fix[2] > which papers over the problem by disallowing the mount option > nouser_xattr is the worst option, since it doesn't actually fix the > underlying file system bug. > > [2] commit 2d544ec923dbe5 ("ext4: remove deprecated noacl/nouser_xattr > options") > > Backporting [2] will shut up the syzbot reproducer, yes. But that's > because the syzbot reproducer was inadequately minimized. *This* > reproducer, which is a easier for a human to understand and which is > appropriately minimized will trigger exact same issue, with or without > > #!/bin/bash -vx > # > # This reproduces an ext4 bug caused by an unfortunate interaction > # between lazytime updates happening when a file system is being > # unmounted and expand_extra_isize > # > # Initially discovered via syzkaller: > # https://syzkaller.appspot.com/bug?id=3613786cb88c93aa1c6a279b1df6a7b201347d08 > # > img=/tmp/foo.img > dir=/mnt > file=$dir/file0 > > rm -f $img > mke2fs -Fq -t ext4 -I 256 -O ea_inode -b 1024 $img 200k > mount $img $dir > v=$(dd if=/dev/zero bs=2000 count=1 2>/dev/null | tr '\0' =) > touch $file > attr -q -s test -V $v $file > umount $dir > mount -o debug_want_extra_isize=128,lazytime /tmp/foo.img $dir > cat $file > umount $dir > > This is why your proposal to backport commit 2d544ec923dbe5 is not the > right answer. > > > per general understanding of a subsystem uninitialization, a flag > > shall be marked to reject further operation on the sub-system and > > flush the pending operation, then free the resource. In such a > > general method, current handling to create a new ea_inode should not > > crash even it is stupid. sb->s_root seems to be a key global > > resource in ext4 subsystem per my understanding, and should not be > > set as NULL until the last step of unmount operation. > > That's true in general. And yes, simply bypassing the extra_isize > expansion when the file system is being unmounted is certainly better > that backporting the unrelated commit[2]. But the true correct fix is > to optimize how we migrate the xattr from the in-inode storage to the > external xattr block. > > This is also at *best* P2 bug, since (a) it's not real security issue; > just a null pointer derference, and there is no way this could be > leveraged into any kind of denial of service or privilege escalation > attack, and (b) it requires root access, and use of a debugging option > to enable a code path which is in practice never used in production. > It is a syzkaller report, and unfortunately, there seems to be this > assumption that all syzkaller issues are P0 or P1 issues that must be > remediated right away. Which is not the case in this instance. It's > a real bug, and so it should be fixed; but it's not a high priority > bug. > > That being said, if you'd like ot become more experienced in a portion > of ext4 internals, I'd certainly invite you to try to understand how > ext4 extended attributes are managed, and try your hand at fixing this > bug. > > Best regards, > > - Ted Thanks for the elabration of the logic here. I guess below change is similiar with what we are expecting. There are 2 question in the change I do not have anwser yet. But for the crash with NULL sb->s_root, below change does not impact anything because all functions are still called. So I guess the protection with rejecting further request during umount is still needed. Or I still missed something? fs/ext4/xattr.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 7decaaf27e82..546808dbbdd6 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -2551,9 +2551,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode, is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_NOFS); bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_NOFS); - buffer = kvmalloc(value_size, GFP_NOFS); b_entry_name = kmalloc(entry->e_name_len + 1, GFP_NOFS); - if (!is || !bs || !buffer || !b_entry_name) { + if (!is || !bs || !b_entry_name) { error = -ENOMEM; goto out; } @@ -2565,14 +2564,21 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode, /* Save the entry name and the entry value */ if (entry->e_value_inum) { + buffer = kvmalloc(value_size, GFP_NOFS); + if (!buffer) { + error = -ENOMEM; + goto out; + } + error = ext4_xattr_inode_get(inode, entry, buffer, value_size); if (error) goto out; } else { size_t value_offs = le16_to_cpu(entry->e_value_offs); - memcpy(buffer, (void *)IFIRST(header) + value_offs, value_size); + buffer = (void *)IFIRST(header) + value_offs; } + /* Can we reuse entry->e_name with assumption of \0 for all e_name? */ memcpy(b_entry_name, entry->e_name, entry->e_name_len); b_entry_name[entry->e_name_len] = '\0'; i.name = b_entry_name; @@ -2585,11 +2591,6 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode, if (error) goto out; - /* Remove the chosen entry from the inode */ - error = ext4_xattr_ibody_set(handle, inode, &i, is); - if (error) - goto out; - i.value = buffer; i.value_len = value_size; error = ext4_xattr_block_find(inode, &i, bs); @@ -2597,13 +2598,18 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode, goto out; /* Add entry which was removed from the inode into the block */ + /* Can this function remove in inode xattr automatically? */ error = ext4_xattr_block_set(handle, inode, &i, bs); if (error) goto out; - error = 0; + + /* Remove the chosen entry from the inode */ + error = ext4_xattr_ibody_set(handle, inode, &i, is); + out: kfree(b_entry_name); - kvfree(buffer); + if (entry->e_value_inum && buffer) + kvfree(buffer); if (is) brelse(is->iloc.bh); if (bs)