On 26/07/2016 11:38, Jan Kara wrote: > Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and > queue_io() so that inodes which have only dirty timestamps are properly > written on fsync(2) and sync(2). However there are other call sites - > most notably going through write_inode_now() - which expect inode to be > clean after WB_SYNC_ALL writeback. This is not currently true as we do > not clear I_DIRTY_TIME in __writeback_single_inode() even for > WB_SYNC_ALL writeback in all the cases. This then resulted in the > following oops because bdev_write_inode() did not clean the inode and > writeback code later stumbled over a dirty inode with detached wb. > > general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN > Modules linked in: > CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > Workqueue: writeback wb_workfn (flush-11:0) > task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 > RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] > locked_inode_to_wb_and_lock_list+0xa2/0x750 > RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 > RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 > RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 > R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 > R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 > FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 > DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > Stack: > ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 > ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 > ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e > Call Trace: > [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 > [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 > [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 > [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 > [< inline >] wb_do_writeback fs/fs-writeback.c:1844 > [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 > [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 > [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 > [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 > [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 > Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e > 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> > 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 > RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 > RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 > fs/fs-writeback.c:281 > RSP <ffff88006cdaf7d0> > ---[ end trace 986a4d314dcb2694 ]--- > > Fix the problem by making sure __writeback_single_inode() writes inode > only with dirty times in WB_SYNC_ALL mode. > > Reported-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx> > Tested-by: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > fs/fs-writeback.c | 1 + > 1 file changed, 1 insertion(+) > > Jens, can you please take this patch? Thanks! > Honza > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 989a2cef6b76..51efb644410a 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > dirty = inode->i_state & I_DIRTY; > if (inode->i_state & I_DIRTY_TIME) { > if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) || > + wbc->sync_mode == WB_SYNC_ALL || > unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || > unlikely(time_after(jiffies, > (inode->dirtied_time_when + > Hi Jan, I'm sorry to say that but this bug is surfacing again. We got it using the latest Ubuntu 16.04 kernel but I did some test using a 4.8 kernel and I was able to get it again. It's not easy to recreate, we have to let guest running for a while with several disks attached and a database test program which trigger this disk is run to get it. Here is the stack I got : [113031.075540] Unable to handle kernel paging request for data at address 0x00000000 [113031.075614] Faulting instruction address: 0xc0000000003692e0 0:mon> t [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590 [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150 [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450 [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580 [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590 [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660 [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130 [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c --- Exception: 0 at 0000000000000000 0:mon> e cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620] pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290 lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590 sp: c0000000fb65f8a0 msr: 800000010280b033 dar: 0 dsisr: 40000000 current = 0xc0000001d69be400 paca = 0xc000000003480000 softe: 0 irq_happened: 0x01 pid = 18689, comm = kworker/u16:10 Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016 0:mon> r R00 = c00000000036cb6c R16 = c0000001fc0312e8 R01 = c0000000fb65f8a0 R17 = c0000001fc031260 R02 = c000000001471600 R18 = c0000001fc031350 R03 = c0000001fc031260 R19 = 0000000000000000 R04 = c0000001d69beee0 R20 = 0000000000000000 R05 = 0000000000000000 R21 = c0000000fb65c000 R06 = 00000001fed50000 R22 = c000000014960b10 R07 = 00029313df052efd R23 = c000000014960af0 R08 = 0000000000000000 R24 = 0000000000000000 R09 = 0000000000000000 R25 = c0000001fc0312e8 R10 = 0000000080000000 R26 = 0000000000000000 R11 = 071c71c71c71c71c R27 = 0000000000000000 R12 = 0000000000000000 R28 = 0000000000000001 R13 = c000000003480000 R29 = c0000001fc031260 R14 = c0000000000fc3a8 R30 = c0000000fb65fba0 R15 = 0000000000000000 R31 = 0000000000000000 pc = c0000000003692e0 locked_inode_to_wb_and_lock_list+0x50/0x290 cfar= c0000000000a14d0 lparcfg_data+0xc10/0xda0 lr = c00000000036cb6c writeback_sb_inodes+0x30c/0x590 msr = 800000010280b033 cr = 24652882 ctr = c000000000126e20 xer = 0000000020000000 trap = 300 dar = 0000000000000000 dsisr = 40000000 The panic is occuring when entering locked_inode_to_wb_and_lock_list(), the inode->i_wb field is NULL. I'm almost sure that we didn't loop in locked_inode_to_wb_and_lock_list() because the LR register is still pointing to the caller. Note that on another CPU, 'systemd-udevd' was doing this: 2:mon> t [link register ] c00000000034db4c __destroy_inode+0xac/0x220 [c0000001dfb57c80] c00000000034db40 __destroy_inode+0xa0/0x220 (unreliable) [c0000001dfb57cb0] c00000000034e094 destroy_inode+0x44/0xc0 [c0000001dfb57ce0] c000000000346ae8 dentry_unlink_inode+0x138/0x220 [c0000001dfb57d10] c000000000346e58 __dentry_kill+0xe8/0x290 [c0000001dfb57d50] c0000000003289a8 __fput+0x1a8/0x310 [c0000001dfb57db0] c0000000000f90e0 task_work_run+0x130/0x190 [c0000001dfb57e00] c00000000001c124 do_notify_resume+0xc4/0xd0 [c0000001dfb57e30] c000000000009b44 ret_from_except_lite+0x70/0x74 --- Exception: c00 (System Call) at 00003fff87b568a8 I checked for the registers and the 2 CPUs are not dealing with the same inode so not sure there is a direct link. I'll try to recreate it on 4.9-rc8 but in the mean time if you have any idea on what's going wrong here... Thanks, Laurent. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html