Re: use-after-free on log replay failure

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 14 Aug 2014 06:59:29 +1000

On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote:
> On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote:
> > On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> > > Hello Dave, Brian,
> > > I will describe a generic reproduction that you ask for.
> > > 
> > > It was performed on pristine XFS code from 3.8.13, taken from here:
> > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> > ....
> > > I mounted XFS with the following options:
> > > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
> > > 
> > > I started a couple of processes writing files sequentially onto this
> > > mount point, and after few seconds crashed the VM.
> > > When the VM came up, I took the metadump file and placed it in:
> > > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
> > > 
> > > Then I set up the following Device Mapper target onto /dev/vde:
> > > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> > > I am attaching the code (and Makefile) of dm-linear-custom target.
> > > It is exact copy of dm-linear, except that it has a module
> > > parameter. With the parameter set to 0, this is an identity mapping
> > > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> > > failed with ENOSPC. There is a workqueue to fail them in a different
> > > context (not sure if really needed, but that's what our "real"
> > > custom
> > > block device does).
> > 
> > Well, they you go. That explains it - an asynchronous dispatch error
> > happening fast enough to race with the synchronous XFS dispatch
> > processing.
> > 
> > dispatch thread			device workqueue
> > xfs_buf_hold();
> > atomic_set(b_io_remaining, 1)
> > atomic_inc(b_io_remaining)
> > submit_bio(bio)
> > queue_work(bio)
> > xfs_buf_ioend(bp, ....);
> >   atomic_dec(b_io_remaining)
> > xfs_buf_rele()
> > 				bio error set to ENOSPC
> > 				  bio->end_io()
> > 				    xfs_buf_bio_endio()
> > 				      bp->b_error = ENOSPC
> > 				      _xfs_buf_ioend(bp, 1);
> > 				        atomic_dec(b_io_remaining)
> > 					  xfs_buf_ioend(bp, 1);
> > 					    queue_work(bp)
> > xfs_buf_iowait()
> >  if (bp->b_error) return error;
> > if (error)
> >   xfs_buf_relse()
> >     xfs_buf_rele()
> >       xfs_buf_free()
> > 
> > And now we have a freed buffer that is queued on the io completion
> > queue. Basically, it requires the buffer error to be set
> > asynchronously *between* the dispatch decrementing it's I/O count
> > after dispatch, but before we wait on the IO.
> > 
> 
> That's basically the theory I wanted to test with the experimental
> patch. E.g., the error check races with the iodone workqueue item.
> 
> > Not sure what the right fix is yet - removing the bp->b_error check
> > from xfs_buf_iowait() doesn't solve the problem - it just prevents
> > this code path from being tripped over by the race condition.
> > 
> 
> Perhaps I'm missing some context... I don't follow how removing the
> error check doesn't solve the problem. It clearly closes the race and
> perhaps there are other means of doing the same thing, but what part of
> the problem does that leave unresolved?

Anything that does:

	xfs_buf_iorequest(bp);
	if (bp->b_error)
		xfs_buf_relse(bp);

is susceptible to the same race condition. based on bp->b_error
being set asynchronously and before the buffer IO completion
processing is complete.

> E.g., we provide a
> synchronization mechanism for an async submission path and an object
> (xfs_buf) that is involved with potentially multiple such async (I/O)
> operations. The async callback side manages the counts of outstanding
> bios etc. to set the state of the buf object correctly and fires a
> completion when everything is done. The calling side simply waits on the
> completion before it can analyze state of the object. Referring to
> anything inside that object that happens to be managed by the buffer I/O
> mechanism before the buffer is considered complete just seems generally
> racy.

The point is that the IO submitter holds the buffer lock and so has
"exclusive" access to the buffer, even after it is submitted. It is
allowed to check the internal state of the buffer at any time, and
it is expected to be sane, including while IO completion processing
is running.

The real issue is that workqueue based  IO completion processing is
not protected by a reference count of any kind for synchronous IO.
It is done with only the reference count of lock holder held, and so
if the lock holder unlocks and frees the buffer, then that buffer
will be freed.

This issue doesn't exist with B_ASYNC IO submission, because the
B_ASYNC IO owns the reference and the the buffer lock and drops them
from the workqueue when the IO comlpetion processing actuall
completes...

> It looks like submit_bio() manages this by providing the error through
> the callback (always). It also doesn't look like submission path is
> guaranteed to be synchronous either (consider md, which appears to use
> workqueues and kernel threads)), so I'm not sure that '...;
> xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
> you're explicitly looking for a write verifier error or something and
> do nothing further on the buf contingent on completion (e.g., freeing it
> or something it depends on).

My point remains that it *should be safe*, and the intent is that
the caller should be able to check for submission errors without
being exposed to a use after free situation. That's the bug we need
to fix, not say "you can't check for submission errors on
synchronous IO" to avoid the race condition.....

Cheers,

Dave
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs