On Mon, May 01, 2023 at 11:18:28AM +0200, Aleksandr Nogikh wrote: > > > > There are two reproducers, one that mounts a sysv file system, and the > > other which mounts a udf file system. There is no mention of ext4 in > > the stack trace, and yet syzbot has assigned this to the ext4 > > subsystem for some unknown reason. > > In this particular case, there were two ext4-related crashes as well: > > https://syzkaller.appspot.com/text?tag=CrashReport&x=14c7dd1b480000 > https://syzkaller.appspot.com/text?tag=CrashReport&x=1153a07f480000 > > I think syzbot picked a too generic frame as a bug title and it just > got confused by crashes belonging to different filesystems. > Maybe we need to display subsystems for individual crashes as well, so > that it's easier for a human to understand what's going on.. Hmm.... the the two ext4-related crashes have very different stack traces. In the first, there was some ext4 functions on the stack trace, but those are not relevant, since we take an interrupt, and it was apparently an arm64 IPI, which then results in the call to invalidate_bh_lrus(), which then triggers the warning. So this *might* be related to ext4, but it could be any other file system that could have been mounted at the same time, since syzbot runs multiple streams of system calls in parallel. In the second stack trace, we were in the middle of unmounting the file system, and ext4_but_super() called invalidate_bdev(), which in turn called invalidate_bh_lrus(), and that's when the buffer head layer finds a buffer head whose refcount is zero, and that triggers the: VFS: brelse: Trying to free free buffer ... which then leads to the WARNING in invalidate_bh_lru(). So yes, the problem is that syzbot chose generic description based on the warning. An analogy might be a medical examiner in a police procedural TV show[1] having seen a dozen corpses that were shot in the torso, leading to them bleeding out and dying, erroneously concluding that all six homocides *must* be related. In fact, the size of the bullet might be different, and who might have been holding the gun might be very different. [1] https://www.youtube.com/watch?v=lMalvNeJFLk Worse, even if we did a better job disambiguating based on the stack trace, without a reliable reproducer, the stack trace is of very limited utility for trying to track down this kind of bug. Going back to the crime scene analogy, the stack trace basically tells us where the body was found. It doesn't tell us much about who might have been fired the bullet. That is, we need to know who called brelse() or put_bh() on the buffer_head one too many times, leading to the report that there was an attempt to free a free buffer. (Fortunately, such bugs a relatively rare, although as this syzbot report demonstrates, some still exist --- if I had to guess, the bugs are on some error handling path which is very rarely hit.) Perhaps syzbot could special case handling for those objects (like struct buffer_head) which have a refcount and where the refcount can go negative, and treat functions that manipulate such objects much like how syzbot handles KASAN errors? The tricky bit is the immediate caller of the invalidate_bh_lrus() is not necessarily the best one to use. For example, invalidate_bdev() calls invalidate_bh_lrus(), and it's the caller of invalidate_bdev() which is interesting in this case --- e.g., ext4_put_super(). In the first stack trace, nothing in the stack trace is helpful, so there may not much we can do debug this unless we can get a reproducer that reproduces this particular stack trace. In general, though, syzbot should disregard any functions after the functions indicating that an interrupt had taken place, since which code was interrupted when the IRQ came in is just an innocent victim. For now, given that we *do* have a two reproducers, one involving sysv and the other udf file systems, it's probably best that we try to let the maintainers of those file systems try to figure out what be going on. - Ted