Re: [PATCH] writeback: Write dirty times for WB_SYNC_ALL writeback

Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx> · Mon, 5 Dec 2016 15:00:45 +0100

On 26/07/2016 11:38, Jan Kara wrote:
> Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
> queue_io() so that inodes which have only dirty timestamps are properly
> written on fsync(2) and sync(2). However there are other call sites -
> most notably going through write_inode_now() - which expect inode to be
> clean after WB_SYNC_ALL writeback. This is not currently true as we do
> not clear I_DIRTY_TIME in __writeback_single_inode() even for
> WB_SYNC_ALL writeback in all the cases. This then resulted in the
> following oops because bdev_write_inode() did not clean the inode and
> writeback code later stumbled over a dirty inode with detached wb.
> 
>   general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>   Modules linked in:
>   CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>   Workqueue: writeback wb_workfn (flush-11:0)
>   task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
>   RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
>   locked_inode_to_wb_and_lock_list+0xa2/0x750
>   RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
>   RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>   RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>   RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>   R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>   R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>   FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>   DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>   Stack:
>    ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>    ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>    ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>   Call Trace:
>    [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>    [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>    [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>    [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>    [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>    [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>    [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>    [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>    [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>    [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>   Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>   00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>   80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>   RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
>   RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>   fs/fs-writeback.c:281
>    RSP <ffff88006cdaf7d0>
>   ---[ end trace 986a4d314dcb2694 ]---
> 
> Fix the problem by making sure __writeback_single_inode() writes inode
> only with dirty times in WB_SYNC_ALL mode.
> 
> Reported-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> Tested-by: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Jan Kara <jack@xxxxxxx>
> ---
>  fs/fs-writeback.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> Jens, can you please take this patch? Thanks!
> 								Honza
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 989a2cef6b76..51efb644410a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  	dirty = inode->i_state & I_DIRTY;
>  	if (inode->i_state & I_DIRTY_TIME) {
>  		if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
> +		    wbc->sync_mode == WB_SYNC_ALL ||
>  		    unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) ||
>  		    unlikely(time_after(jiffies,
>  					(inode->dirtied_time_when +
> 

Hi Jan,

I'm sorry to say that but this bug is surfacing again.

We got it using the latest Ubuntu 16.04 kernel but I did some test using
a 4.8 kernel and I was able to get it again.
It's not easy to recreate, we have to let guest running for a while with
several disks attached and a database test program which trigger this
disk is run to get it.

Here is the stack I got :

[113031.075540] Unable to handle kernel paging request for data at
address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
--- Exception: 0  at 0000000000000000
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620]
    pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
    lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590
    sp: c0000000fb65f8a0
   msr: 800000010280b033
   dar: 0
 dsisr: 40000000
  current = 0xc0000001d69be400
  paca    = 0xc000000003480000	 softe: 0	 irq_happened: 0x01
    pid   = 18689, comm = kworker/u16:10
Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016
0:mon> r
R00 = c00000000036cb6c   R16 = c0000001fc0312e8
R01 = c0000000fb65f8a0   R17 = c0000001fc031260
R02 = c000000001471600   R18 = c0000001fc031350
R03 = c0000001fc031260   R19 = 0000000000000000
R04 = c0000001d69beee0   R20 = 0000000000000000
R05 = 0000000000000000   R21 = c0000000fb65c000
R06 = 00000001fed50000   R22 = c000000014960b10
R07 = 00029313df052efd   R23 = c000000014960af0
R08 = 0000000000000000   R24 = 0000000000000000
R09 = 0000000000000000   R25 = c0000001fc0312e8
R10 = 0000000080000000   R26 = 0000000000000000
R11 = 071c71c71c71c71c   R27 = 0000000000000000
R12 = 0000000000000000   R28 = 0000000000000001
R13 = c000000003480000   R29 = c0000001fc031260
R14 = c0000000000fc3a8   R30 = c0000000fb65fba0
R15 = 0000000000000000   R31 = 0000000000000000
pc  = c0000000003692e0 locked_inode_to_wb_and_lock_list+0x50/0x290
cfar= c0000000000a14d0 lparcfg_data+0xc10/0xda0
lr  = c00000000036cb6c writeback_sb_inodes+0x30c/0x590
msr = 800000010280b033   cr  = 24652882
ctr = c000000000126e20   xer = 0000000020000000   trap =  300
dar = 0000000000000000   dsisr = 40000000

The panic is occuring when entering locked_inode_to_wb_and_lock_list(),
the inode->i_wb field is NULL. I'm almost sure that we didn't loop in
locked_inode_to_wb_and_lock_list() because the LR register is still
pointing to the caller.

Note that on another CPU, 'systemd-udevd' was doing this:
2:mon> t
[link register   ] c00000000034db4c __destroy_inode+0xac/0x220
[c0000001dfb57c80] c00000000034db40 __destroy_inode+0xa0/0x220 (unreliable)
[c0000001dfb57cb0] c00000000034e094 destroy_inode+0x44/0xc0
[c0000001dfb57ce0] c000000000346ae8 dentry_unlink_inode+0x138/0x220
[c0000001dfb57d10] c000000000346e58 __dentry_kill+0xe8/0x290
[c0000001dfb57d50] c0000000003289a8 __fput+0x1a8/0x310
[c0000001dfb57db0] c0000000000f90e0 task_work_run+0x130/0x190
[c0000001dfb57e00] c00000000001c124 do_notify_resume+0xc4/0xd0
[c0000001dfb57e30] c000000000009b44 ret_from_except_lite+0x70/0x74
--- Exception: c00 (System Call) at 00003fff87b568a8

I checked for the registers and the 2 CPUs are not dealing with the same
inode so not sure there is a direct link.

I'll try to recreate it on 4.9-rc8 but in the mean time if you have any
idea on what's going wrong here...

Thanks,
Laurent.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html