On Tue, Apr 12, 2022 at 08:30:13PM +0800, Qu Wenruo wrote: > [BUG] > When running generic/475 with 64K page size and 4K sector size, it has a > very high chance (almost 100%) to hang, with mostly data page locked but > no one is going to unlock it. > > [CAUSE] > With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on > reads"), if we failed to lookup checksum due to metadata IO error, we > will return error for btrfs_submit_data_bio(). > > This will cause the page to be unlocked twice in btrfs_do_readpage(): > > btrfs_do_readpage() > |- submit_extent_page() > | |- submit_one_bio() > | |- btrfs_submit_data_bio() > | |- if (ret) { > | |- bio->bi_status = ret; > | |- bio_endio(bio); } > | In the endio function, we will call end_page_read() > | and unlock_extent() to cleanup the subpage range. > | > |- if (ret) { > |- unlock_extent(); end_page_read() } > Here we unlock the extent and cleanup the subpage range > again. > > For unlock_extent(), it's mostly double unlock safe. > > But for end_page_read(), it's not, especially for subpage case, > as for subpage case we will call btrfs_subpage_end_reader() to reduce > the reader number, and use that to number to determine if we need to > unlock the full page. > > If double accounted, it can underflow the number and leave the page > locked without anyone to unlock it. > > [FIX] > The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on > reads") itself is completely fine, it's our existing code not properly > handling the error from bio submission hook properly. > > This patch will make submit_one_bio() to return void so that the callers > will never be able to do cleanup when bio submission hook fails. > > CC: stable@xxxxxxxxxxxxxxx # 5.18+ BTW stable tags are only for released kernels, if it's still in some rc then Fixes: is appropriate.