On Wed, Jan 2, 2019 at 3:40 PM Jan Kara <jack@xxxxxxx> wrote: > > On Fri 28-12-18 22:34:13, Tetsuo Handa wrote: > > On 2018/08/06 19:09, Jan Kara wrote: > > > On Tue 31-07-18 00:07:22, Tetsuo Handa wrote: > > >> On 2018/07/21 5:06, Andrew Morton wrote: > > >>> On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: > > >>> > > >>>>> > > >>>>> This report is stalling after mount() completed and process used remap_file_pages(). > > >>>>> I think that we might need to use debug printk(). But I don't know what to examine. > > >>>>> > > >>>> > > >>>> Andrew, can you pick up this debug printk() patch? > > >>>> I guess we can get the result within one week. > > >>> > > >>> Sure, let's toss it in -next for a while. > > >>> > > >>>> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001 > > >>>> From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > > >>>> Date: Fri, 20 Jul 2018 19:29:06 +0900 > > >>>> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem > > >>>> > > >>>> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain > > >>>> __getblk_gfp() line in the backtrace. Since there is a comment block that > > >>>> says that __getblk_gfp() will lock up the machine if try_to_free_buffers() > > >>>> attempt from grow_dev_page() is failing, let's start from checking whether > > >>>> syzbot is hitting that case. This change will be removed after the bug is > > >>>> fixed. > > >>> > > >>> I'm not sure that grow_dev_page() is hanging. It has often been > > >>> suspected, but always is proven innocent. Lets see. > > >> > > >> syzbot reproduced this problem ( https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000 ) . > > >> It says that grow_dev_page() is returning 1 but __find_get_block() is failing forever. Any idea? > > > > > > Looks like some kind of a race where device block size gets changed while > > > getblk() runs (and creates buffers for underlying page). I don't have time > > > to nail it down at this moment can have a look into it later unless someone > > > beats me to it. > > > > I feel that the frequency of hitting this problem was decreased > > by merging loop module's ioctl() serialization patches. But this > > problem is still there, and syzbot got a new line in > > https://syzkaller.appspot.com/text?tag=CrashLog&x=177f889f400000 . > > > > [ 615.881781] __loop_clr_fd: partition scan of loop5 failed (rc=-22) > > [ 619.059920] syz-executor4(2193): getblk(): executed=cd bh_count=0 bh_state=29 > > [ 622.069808] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0 > > [ 625.080013] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0 > > [ 628.089900] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0 > > > > I guess that loop module is somehow related to this problem. > > I had a look into this and the only good explanation for this I have is > that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits). > If that would happen, we'd get exactly the behavior syzkaller observes > because grow_buffers() would populate different page than > __find_get_block() then looks up. > > However I don't see how that's possible since the filesystem has the block > device open exclusively and blkdev_bszset() makes sure we also have > exclusive access to the block device before changing the block device size. > So changing block device block size after filesystem gets access to the > device should be impossible. If this is that critical and impossible to fire, maybe it makes sense to add a corresponding debug check to some code paths? syzkaller will immediately catch any violations if they happen. > Anyway, could you perhaps add to your debug patch a dump of 'size' passed > to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us > whether my theory is right or not. Thanks!