On Fri, Dec 11, 2015 at 11:09:58AM +1100, NeilBrown wrote: > On Fri, Dec 11 2015, Shaohua Li wrote: > > > On Wed, Dec 09, 2015 at 05:34:45PM +1100, NeilBrown wrote: > >> On Wed, Dec 09 2015, Shaohua Li wrote: > >> > >> > On Wed, Dec 09, 2015 at 11:36:30AM +1100, NeilBrown wrote: > >> >> On Thu, Dec 03 2015, Christoph Hellwig wrote: > >> >> > >> >> > Currently the raid5-cache code is heavily relying on GFP_NOFAIL allocations. > >> >> > > >> >> > I've looked into replacing these with mempools and biosets, and for the > >> >> > bio and the meta_page that's pretty trivial as they have short life times > >> >> > and do make guaranteed progress. I'm massively struggling with the iounit > >> >> > allocation, though. These can live on for a long time over log I/O, cache > >> >> > flushing and last but not least RAID I/O, and every attempt at something > >> >> > mempool-like results in reproducible deadlocks. I wonder if we need to > >> >> > figure out some more efficient data structure to communicate the completion > >> >> > status that doesn't rely on these fairly long living allocations from > >> >> > the I/O path. > >> >> > >> >> Presumably the root cause of these deadlocks is that the raid5d thread > >> >> has called > >> >> handle_stripe -> ops_run_io ->r5l_write_stripe -> r5l_log_stripe > >> >> -> r5l_get_meta -> r5l_new_meta > >> >> > >> >> and r5l_new_meta is blocked on memory allocation, which won't complete > >> >> until some raid5 stripes get written out, which requires raid5d to do > >> >> something more useful than sitting and waiting. > >> >> > >> >> I suspect a good direction towards a solution would be to allow the > >> >> memory allocation to fail, to cleanly propagate that failure indication > >> >> up through r5l_log_stripe to r5l_write_stripe which falls back to adding > >> >> the stripe_head to ->no_space_stripes. > >> >> > >> >> Then we only release stripes from no_space_stripes when a memory > >> >> allocation might succeed. > >> >> > >> >> There are lots of missing details, and possibly we would need a separate > >> >> list rather than re-using no_space_stripes. > >> >> But the key idea is that raid5d should never block (except beneath > >> >> submit_bio on some other device) and when it cannot make progress > >> >> without blocking, it should queue the stripe_head for later handling. > >> >> > >> >> Does that make sense? > >> > > >> > It does remove the scary __GFP_NOFAIL, but the approach is essentially > >> > idential to a 'retry after allocation failure'. Why not just let the mm > >> > (with __GFP_NOFAIL) to do the retry then? > >> > > >> > >> Because deadlocks. > >> > >> If raid5d is waiting for the mm to allocated memory, then it cannot > >> retire write requests which could free up memory. > > > > Ok, I missed the deadlock issue. Looks raid5d can't sleep waitting for > > memory. That means the mempool for metadata page/bio doesn't work too, > > as raid5d might block and not dispatch previous IO. Your proposal looks > > reasonable. > > raid5d mustn't sleep waiting for something that only raid5d can provide. > > A bio allocated from a mempool will be returned to that mempool when the > IO request completes (if the end_io routine calls bio_put, which it > does). > So when waiting on a mempool for a bio, you are only waiting for some > underlying device to perform IO, you are not waiting for anything that > raid5d might do. So that is safe. > > Similarly the meta_page is freed in the end_io function, so it is safe > to wait for that. > Note that the important feature of a mempool is not so much that it > pre-allocates some memory. The important point is that when that > pre-allocated memory is freed it can *only* be used by the owner of the > mempool. > By contrast memory allocated with NOFAIL, when freed might be used by > any other caller anywhere in the kernel. > So waiting on a mempool means waiting for certain well defined events. > Waiting in NOFAIL could end up waiting for almost anything. So it is > much harder to reason about deadlocks. > > The bio and the meta_page contrast with the r5l_io_unit which isn't > freed by the first end_io, but remains until raid5d has done some more > work. So raid5d cannot wait for a r5l_io_unit to be freed. I'm confused 2 bios are already enough to avoid deadlock, sorry. Can you peek Christoph's patches? I'll work on a patch for r5l_io_unit. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html