Re: use-after-free on log replay failure

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 11 Aug 2014 09:20:57 -0400

On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote:
> Hi Brian,
> 
> On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote:
> > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> >> Hello Dave and Brian,
> >>
> >> Dave, I tried the patch you suggested, but it does not fix the issue. I did
> >> some further digging, and it appears that _xfs_buf_ioend(schedule=1) can be
> >> called from xfs_buf_iorequest(), which the patch fixes, but also from
> >> xfs_buf_bio_end_io() which is my case. I am reproducing the issue pretty
> >> easily. The flow that I have is like this:
> >> - xlog_recover() calls xlog_find_tail(). This works alright.
> >
> > What's the purpose of a sleep here?
> >
> >> - Now I add a small sleep before calling xlog_do_recover(), and meanwhile I
> >> instruct my block device to return ENOSPC for any WRITE from now on.
> >>
> >> What seems to happen is that several WRITE bios are submitted and they all
> >> fail. When they do, they reach xfs_buf_ioend() through a stack like this:
> >>
> >> Aug  6 15:23:07 dev kernel: [  304.410528] [56]xfs*[xfs_buf_ioend:1056]
> >> XFS(dm-19): Scheduling xfs_buf_iodone_work on error
> >> Aug  6 15:23:07 dev kernel: [  304.410534] Pid: 56, comm: kworker/u:1
> >> Tainted: G        W  O 3.8.13-557-generic #1382000791
> >> Aug  6 15:23:07 dev kernel: [  304.410537] Call Trace:
> >> Aug  6 15:23:07 dev kernel: [  304.410587]  [<ffffffffa04d6654>]
> >> xfs_buf_ioend+0x1a4/0x1b0 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.410621]  [<ffffffffa04d6685>]
> >> _xfs_buf_ioend+0x25/0x30 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.410643]  [<ffffffffa04d6b3d>]
> >> xfs_buf_bio_end_io+0x3d/0x50 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.410652]  [<ffffffff811c3d8d>]
> >> bio_endio+0x1d/0x40
> >> ...
> >>
> >> At this point, they are scheduled to run xlog_recover_iodone through
> >> xfslogd_workqueue.
> >> The first callback that gets called, calls xfs_do_force_shutdown in stack
> >> like this:
> >>
> >> Aug  6 15:23:07 dev kernel: [  304.411791] XFS (dm-19): metadata I/O error:
> >> block 0x3780001 ("xlog_recover_iodone") error 28 numblks 1
> >> Aug  6 15:23:07 dev kernel: [  304.413493] XFS (dm-19):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.413837]  [<ffffffffa04e0b60>]
> >> xfs_do_force_shutdown+0x40/0x180 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413870]  [<ffffffffa0526848>] ?
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413902]  [<ffffffffa0526848>]
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413923]  [<ffffffffa04d645d>]
> >> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.413930]  [<ffffffff81077a11>]
> >> process_one_work+0x141/0x4a0
> >> Aug  6 15:23:07 dev kernel: [  304.413937]  [<ffffffff810789e8>]
> >> worker_thread+0x168/0x410
> >> Aug  6 15:23:07 dev kernel: [  304.413943]  [<ffffffff81078880>] ?
> >> manage_workers+0x120/0x120
> >> Aug  6 15:23:07 dev kernel: [  304.413949]  [<ffffffff8107df10>]
> >> kthread+0xc0/0xd0
> >> Aug  6 15:23:07 dev kernel: [  304.413954]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.413976]  [<ffffffff816ab86c>]
> >> ret_from_fork+0x7c/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.413986]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.413990] ---[ end trace 988d698520e1fa81
> >> ]---
> >> Aug  6 15:23:07 dev kernel: [  304.414012] XFS (dm-19): I/O Error Detected.
> >> Shutting down filesystem
> >> Aug  6 15:23:07 dev kernel: [  304.415936] XFS (dm-19): Please umount the
> >> filesystem and rectify the problem(s)
> >>
> >> But the rest of the callbacks also arrive:
> >> Aug  6 15:23:07 dev kernel: [  304.417812] XFS (dm-19): metadata I/O error:
> >> block 0x3780002 ("xlog_recover_iodone") error 28 numblks 1
> >> Aug  6 15:23:07 dev kernel: [  304.420420] XFS (dm-19):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.420427] XFS (dm-19): metadata I/O error:
> >> block 0x3780008 ("xlog_recover_iodone") error 28 numblks 8
> >> Aug  6 15:23:07 dev kernel: [  304.422708] XFS (dm-19):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.422738] XFS (dm-19): metadata I/O error:
> >> block 0x3780010 ("xlog_recover_iodone") error 28 numblks 8
> >>
> >> The mount sequence fails and goes back to the caller:
> >> Aug  6 15:23:07 dev kernel: [  304.423438] XFS (dm-19): log mount/recovery
> >> failed: error 28
> >> Aug  6 15:23:07 dev kernel: [  304.423757] XFS (dm-19): log mount failed
> >>
> >> But there are still additional callbacks to deliver, which the mount
> >> sequence did not wait for!
> >> Aug  6 15:23:07 dev kernel: [  304.425717] XFS ( @dR):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.425723] XFS ( @dR): metadata I/O error:
> >> block 0x3780018 ("xlog_recover_iodone") error 28 numblks 8
> >> Aug  6 15:23:07 dev kernel: [  304.428239] XFS ( @dR):
> >> xfs_do_force_shutdown(0x1) called from line 377 of file
> >> /mnt/work/alex/xfs/fs/xfs/xfs_log_recover.c.  Return address =
> >> 0xffffffffa0526848
> >> Aug  6 15:23:07 dev kernel: [  304.428246] XFS ( @dR): metadata I/O error:
> >> block 0x37800a0 ("xlog_recover_iodone") error 28 numblks 16
> >>
> >> Notice the junk that they are printing! Naturally, because xfs_mount
> >> structure has been kfreed.
> >>
> >> Finally the kernel crashes (instead of printing junk), because the xfs_mount
> >> structure is gone, but the callback tries to access it (printing the name):
> >>
> >> Aug  6 15:23:07 dev kernel: [  304.430796] general protection fault: 0000
> >> [#1] SMP
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Modules linked in: xfrm_user
> >> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 iscsi_scst_tcp(O)
> >> scst_vdisk(O) scst(O) dm_zcache(O) dm_btrfs(O) xfs(O) btrfs(O) libcrc32c
> >> raid456(O) async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq
> >> async_tx raid1(O) md_mod deflate zlib_deflate ctr twofish_generic
> >> twofish_x86_64_3way glue_helper lrw xts gf128mul twofish_x86_64
> >> twofish_common camellia_generic serpent_generic blowfish_generic
> >> blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic xcbc
> >> rmd160 sha512_generic crypto_null af_key xfrm_algo dm_round_robin kvm vfat
> >> fat ppdev psmouse microcode nfsd nfs_acl dm_multipath(O) serio_raw
> >> parport_pc nfsv4 dm_iostat(O) mac_hid i2c_piix4 auth_rpcgss nfs fscache
> >> lockd sunrpc lp parport floppy
> >> Aug  6 15:23:07 dev kernel: [  304.432035] CPU 1
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Pid: 133, comm: kworker/1:1H
> >> Tainted: G        W  O 3.8.13-557-generic #1382000791 Bochs Bochs
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RIP: 0010:[<ffffffff8133c2cb>]
> >> [<ffffffff8133c2cb>] strnlen+0xb/0x30
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RSP: 0018:ffff880035461b08
> >> EFLAGS: 00010086
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RAX: 0000000000000000 RBX:
> >> ffffffff81e6a4e7 RCX: 0000000000000000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RDX: e4e8390a265c0000 RSI:
> >> ffffffffffffffff RDI: e4e8390a265c0000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RBP: ffff880035461b08 R08:
> >> 000000000000ffff R09: 000000000000ffff
> >> Aug  6 15:23:07 dev kernel: [  304.432035] R10: 0000000000000000 R11:
> >> 00000000000004cd R12: e4e8390a265c0000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] R13: ffffffff81e6a8c0 R14:
> >> 0000000000000000 R15: 000000000000ffff
> >> Aug  6 15:23:07 dev kernel: [  304.432035] FS:  0000000000000000(0000)
> >> GS:ffff88007fc80000(0000) knlGS:0000000000000000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] CS:  0010 DS: 0000 ES: 0000 CR0:
> >> 000000008005003b
> >> Aug  6 15:23:07 dev kernel: [  304.432035] CR2: 00007fc902ffbfd8 CR3:
> >> 000000007702a000 CR4: 00000000000006e0
> >> Aug  6 15:23:07 dev kernel: [  304.432035] DR0: 0000000000000000 DR1:
> >> 0000000000000000 DR2: 0000000000000000
> >> Aug  6 15:23:07 dev kernel: [  304.432035] DR3: 0000000000000000 DR6:
> >> 00000000ffff0ff0 DR7: 0000000000000400
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Process kworker/1:1H (pid: 133,
> >> threadinfo ffff880035460000, task ffff880035412e00)
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Stack:
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461b48
> >> ffffffff8133dd5e 0000000000000000 ffffffff81e6a4e7
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  ffffffffa0566cba
> >> ffff880035461c80 ffffffffa0566cba ffffffff81e6a8c0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  ffff880035461bc8
> >> ffffffff8133ef59 ffff880035461bc8 ffffffff81c84040
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Call Trace:
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133dd5e>]
> >> string.isra.4+0x3e/0xd0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133ef59>]
> >> vsnprintf+0x219/0x640
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8133f441>]
> >> vscnprintf+0x11/0x30
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105a971>]
> >> vprintk_emit+0xc1/0x490
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8105aa20>] ?
> >> vprintk_emit+0x170/0x490
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8168b992>]
> >> printk+0x61/0x63
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9bf1>]
> >> __xfs_printk+0x31/0x50 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e9e43>]
> >> xfs_notice+0x53/0x60 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04e0c15>]
> >> xfs_do_force_shutdown+0xf5/0x180 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>] ?
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa0526848>]
> >> xlog_recover_iodone+0x48/0x70 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffffa04d645d>]
> >> xfs_buf_iodone_work+0x4d/0xa0 [xfs]
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81077a11>]
> >> process_one_work+0x141/0x4a0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff810789e8>]
> >> worker_thread+0x168/0x410
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff81078880>] ?
> >> manage_workers+0x120/0x120
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107df10>]
> >> kthread+0xc0/0xd0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff816ab86c>]
> >> ret_from_fork+0x7c/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  [<ffffffff8107de50>] ?
> >> flush_kthread_worker+0xb0/0xb0
> >> Aug  6 15:23:07 dev kernel: [  304.432035] Code: 31 c0 80 3f 00 55 48 89 e5
> >> 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7 48 29 f8 5d c3 66 90 55 31
> >> c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8 eb 0c 0f 1f 00 48 83 ee
> >> 01 80 38 00 74
> >> Aug  6 15:23:07 dev kernel: [  304.432035] RIP  [<ffffffff8133c2cb>]
> >> strnlen+0xb/0x30
> >> Aug  6 15:23:07 dev kernel: [  304.432035]  RSP <ffff880035461b08>
> >>
> >>
> >> So previously you said: "So, something is corrupting memory and stamping all
> >> over the XFS structures." and also "given you have a bunch of out of tree
> >> modules loaded (and some which are experiemental) suggests that you have a
> >> problem with your storage...".
> >>
> >> But I believe, my analysis shows that during the mount sequence XFS does not
> >> wait properly for all the bios to complete, before failing the mount
> >> sequence back to the caller.
> >>
> >
> > As an experiment, what about the following? Compile tested only and not
> > safe for general use.
> >
> > What might help more is to see if you can create a reproducer on a
> > recent, clean kernel. Perhaps a metadump of your reproducer fs combined
> > with whatever block device ENOSPC hack you're using would do it.
> >
> > Brian
> >
> > ---8<---
> >
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index cd7b8ca..fbcf524 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
> >   * case nothing will ever complete.  It returns the I/O error code, if any, or
> >   * 0 if there was no error.
> >   */
> > -int
> > -xfs_buf_iowait(
> > -       xfs_buf_t               *bp)
> > +static int
> > +__xfs_buf_iowait(
> > +       struct xfs_buf          *bp,
> > +       bool                    skip_error)
> >  {
> >         trace_xfs_buf_iowait(bp, _RET_IP_);
> >
> > -       if (!bp->b_error)
> > +       if (skip_error || !bp->b_error)
> >                 wait_for_completion(&bp->b_iowait);
> >
> >         trace_xfs_buf_iowait_done(bp, _RET_IP_);
> >         return bp->b_error;
> >  }
> >
> > +int
> > +xfs_buf_iowait(
> > +       struct xfs_buf          *bp)
> > +{
> > +       return __xfs_buf_iowait(bp, false);
> > +}
> > +
> >  xfs_caddr_t
> >  xfs_buf_offset(
> >         xfs_buf_t               *bp,
> > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
> >                 bp = list_first_entry(&io_list, struct xfs_buf, b_list);
> >
> >                 list_del_init(&bp->b_list);
> > -               error2 = xfs_buf_iowait(bp);
> > +               error2 = __xfs_buf_iowait(bp, true);
> >                 xfs_buf_relse(bp);
> >                 if (!error)
> >                         error = error2;
> >
> > ---
> I think that this patch fixes the problem. I tried reproducing it like
> 30 times, and it doesn't happen with this patch. Dropping this patch
> reproduces the problem within 1 or 2 tries. Thanks!
> What are next steps? How to make it "safe for general use"?
> 

Ok, thanks for testing. I think that implicates the caller bypassing the
expected blocking with the right sequence of log recovery I/Os and
device failure. TBH, I'd still like to see the specifics, if possible.
Could you come up with a generic reproducer for this? I think a metadump
of the fs with the dirty log plus whatever device failure simulation
hack you're using would suffice.

The ideal fix is not yet clear to me. Technically, we could always find
a way to customize this particular path to rely on b_iowait since that
appears safe, but that could just be a band aid over a larger problem.
I'll need to step back and stare at this code some more to try and
understand the layering better, then follow up with something when
things are more clear.

Brian

> Thanks,
> Alex.
> 
> 
> 
> 
> 
> >
> >> Thanks,
> >> Alex.
> >>
> >>
> >>
> >> -----Original Message----- From: Dave Chinner
> >> Sent: 05 August, 2014 2:07 AM
> >> To: Alex Lyakas
> >> Cc: xfs@xxxxxxxxxxx
> >> Subject: Re: use-after-free on log replay failure
> >>
> >> On Mon, Aug 04, 2014 at 02:00:05PM +0300, Alex Lyakas wrote:
> >> >Greetings,
> >> >
> >> >we had a log replay failure due to some errors that the underlying
> >> >block device returned:
> >> >[49133.801406] XFS (dm-95): metadata I/O error: block 0x270e8c180
> >> >("xlog_recover_iodone") error 28 numblks 16
> >> >[49133.802495] XFS (dm-95): log mount/recovery failed: error 28
> >> >[49133.802644] XFS (dm-95): log mount failed
> >>
> >> #define ENOSPC          28      /* No space left on device */
> >>
> >> You're getting an ENOSPC as a metadata IO error during log recovery?
> >> Thin provisioning problem, perhaps, and the error is occurring on
> >> submission rather than completion? If so:
> >>
> >> 8d6c121 xfs: fix buffer use after free on IO error
> >>
> >> Cheers,
> >>
> >> Dave.
> >> --
> >> Dave Chinner
> >> david@xxxxxxxxxxxxx
> >>
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs