Re: mnt_list corruption triggered during btrfs/326

Qu Wenruo <wqu@xxxxxxxx> · Tue, 7 Jan 2025 08:30:05 +1030

在 2025/1/7 08:20, Daniel Vacek 写道:
On Sat, 4 Jan 2025 at 23:26, Qu Wenruo <wqu@xxxxxxxx> wrote:

在 2025/1/4 21:56, Christian Brauner 写道:
On Wed, Jan 01, 2025 at 07:05:10AM +1030, Qu Wenruo wrote:

在 2024/12/30 19:59, Qu Wenruo 写道:
Hi,

Although I know it's triggered from btrfs, but the mnt_list handling is
out of btrfs' control, so I'm here asking for some help.

Thanks for the report.

[BUG]
With CONFIG_DEBUG_LIST and CONFIG_BUG_ON_DATA_CORRUPTION, and an
upstream 6.13-rc kernel, which has commit 951a3f59d268 ("btrfs: fix
mount failure due to remount races"), I can hit the following crash,
with varied frequency (from 1/4 to hundreds runs no crash):

There is also another WARNING triggered, without btrfs callback involved
at all:

[  192.688671] ------------[ cut here ]------------
[  192.690016] WARNING: CPU: 3 PID: 59747 at fs/mount.h:150

This would indicate that move_from_ns() was called on a mount that isn't
attached to a mount namespace (anymore or never has).

Here's it's particularly peculiar because it looks like the warning is
caused by calling move_from_ns() when moving a mount from an anonymous
mount namespace in attach_recursive_mnt().

Can you please try and reproduce this with
commit 211364bef4301838b2e1 ("fs: kill MNT_ONRB")
from the vfs-6.14.mount branch in
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git ?

After the initial 1000 runs (with 951a3f59d268 ("btrfs: fix mount
failure due to remount races") cherry picked, or it won't pass that test
case), there is no crash nor warning so far.

It's already the best run so far, but I'll keep it running for another
day or so just to be extra safe.

So I guess the offending commit is 2eea9ce4310d ("mounts: keep list of
mounts in an rbtree")?

This one was merged in v6.8 - why would it cause crashes only now?

Because in v6.8 btrfs also migrated to the new mount API, which caused 
the ro/rw mount race which can fail the mount.

That's exactly why the test case is introduced.

Before the recent ro/rw mount fix, the test case won't go that far but 
error out early so we don't have enough loops to trigger the bug.

Putting a list and rb_tree into a union indeed seems a little dangerous,
sorry I didn't notice that earlier, but my vmcore indeed show a
seemingly valid mnt_node (color = 1, both left/right are NULL).

The union seems fine to me as long as the `MNT_ONRB` bit stays
consistent. The crashes (nor warnings) are simply caused by the flag
missing where it should have been set.

That also means the mnt_flag needs to be properly protected, at least 
with the same level of mnt_list/mnt_node.

But a lot of time such flag is atomically accessed using 
test/set/clear_bit(), without the same level of lock protection.
So my current uneducated guess is, there is a race window where the flag 
and member got de-synced.

Thus it's not as safe as a non-unioned member.

Thanks,
Qu

--nX

Thanks a lot for the fix, and it's really a huge relief that it's not
something inside btrfs causing the bug.

Thanks,
Qu

[...]

The only caller doesn't hold @mount_lock is iterate_mounts() but that's
only called from audit, and I'm not sure if audit is even involved in
this case.

This is fine as audit creates a private copy of the mount tree it is
interested in. The mount tree is not visible to other callers anymore.

So I ran out of ideas why this mnt_list can even happen.

Even if it's some btrfs' abuse, all mnt_list users are properly
protected thus it should not lead to such list corruption.

Any advice would be appreciated.

Thanks,
Qu