Re: [RFC] using mempools for raid5-cache

Shaohua Li <shli@xxxxxx> · Thu, 10 Dec 2015 17:10:49 -0800

On Fri, Dec 11, 2015 at 11:09:58AM +1100, NeilBrown wrote:
> On Fri, Dec 11 2015, Shaohua Li wrote:
> 
> > On Wed, Dec 09, 2015 at 05:34:45PM +1100, NeilBrown wrote:
> >> On Wed, Dec 09 2015, Shaohua Li wrote:
> >> 
> >> > On Wed, Dec 09, 2015 at 11:36:30AM +1100, NeilBrown wrote:
> >> >> On Thu, Dec 03 2015, Christoph Hellwig wrote:
> >> >> 
> >> >> > Currently the raid5-cache code is heavily relying on GFP_NOFAIL allocations.
> >> >> >
> >> >> > I've looked into replacing these with mempools and biosets, and for the
> >> >> > bio and the meta_page that's pretty trivial as they have short life times
> >> >> > and do make guaranteed progress.  I'm massively struggling with the iounit
> >> >> > allocation, though.  These can live on for a long time over log I/O, cache
> >> >> > flushing and last but not least RAID I/O, and every attempt at something
> >> >> > mempool-like results in reproducible deadlocks.  I wonder if we need to
> >> >> > figure out some more efficient data structure to communicate the completion
> >> >> > status that doesn't rely on these fairly long living allocations from
> >> >> > the I/O path.
> >> >> 
> >> >> Presumably the root cause of these deadlocks is that the raid5d thread
> >> >> has called
> >> >>    handle_stripe -> ops_run_io ->r5l_write_stripe -> r5l_log_stripe
> >> >>       -> r5l_get_meta -> r5l_new_meta
> >> >> 
> >> >> and r5l_new_meta is blocked on memory allocation, which won't complete
> >> >> until some raid5 stripes get written out, which requires raid5d to do
> >> >> something more useful than sitting and waiting.
> >> >> 
> >> >> I suspect a good direction towards a solution would be to allow the
> >> >> memory allocation to fail, to cleanly propagate that failure indication
> >> >> up through r5l_log_stripe to r5l_write_stripe which falls back to adding
> >> >> the stripe_head to ->no_space_stripes.
> >> >> 
> >> >> Then we only release stripes from no_space_stripes when a memory
> >> >> allocation might succeed.
> >> >> 
> >> >> There are lots of missing details, and possibly we would need a separate
> >> >> list rather than re-using no_space_stripes.
> >> >> But the key idea is that raid5d should never block (except beneath
> >> >> submit_bio on some other device) and when it cannot make progress
> >> >> without blocking, it should queue the stripe_head for later handling.
> >> >> 
> >> >> Does that make sense?
> >> >
> >> > It does remove the scary __GFP_NOFAIL, but the approach is essentially
> >> > idential to a 'retry after allocation failure'. Why not just let the mm
> >> > (with __GFP_NOFAIL) to do the retry then?
> >> >
> >> 
> >> Because deadlocks.
> >> 
> >> If raid5d is waiting for the mm to allocated memory, then it cannot
> >> retire write requests which could free up memory.
> >
> > Ok, I missed the deadlock issue. Looks raid5d can't sleep waitting for
> > memory. That means the mempool for metadata page/bio doesn't work too,
> > as raid5d might block and not dispatch previous IO. Your proposal looks
> > reasonable.
> 
> raid5d mustn't sleep waiting for something that only raid5d can provide.
> 
> A bio allocated from a mempool will be returned to that mempool when the
> IO request completes (if the end_io routine calls bio_put, which it
> does).
> So when waiting on a mempool for a bio, you are only waiting for some
> underlying device to perform IO, you are not waiting for anything that
> raid5d might do.  So that is safe.
> 
> Similarly the meta_page is freed in the end_io function, so it is safe
> to wait for that.
> Note that the important feature of a mempool is not so much that it
> pre-allocates some memory.  The important point is that when that
> pre-allocated memory is freed it can *only* be used by the owner of the
> mempool.
> By contrast memory allocated with NOFAIL, when freed might be used by
> any other caller anywhere in the kernel.
> So waiting on a mempool means waiting for certain well defined events.
> Waiting in NOFAIL could end up waiting for almost anything.  So it is
> much harder to reason about deadlocks.
> 
> The bio and the meta_page contrast with the r5l_io_unit which isn't
> freed by the first end_io, but remains until raid5d has done some more
> work.  So raid5d cannot wait for a r5l_io_unit to be freed.

I'm confused 2 bios are already enough to avoid deadlock, sorry. Can you
peek Christoph's patches? I'll work on a patch for r5l_io_unit.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html