On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote: > On Thu, Mar 09 2017, Artur Paszkiewicz wrote: > > > Implement the calculation of partial parity for a stripe and PPL write > > logging functionality. The description of PPL is added to the > > documentation. More details can be found in the comments in raid5-ppl.c. > > > > Attach a page for holding the partial parity data to stripe_head. > > Allocate it only if mddev has the MD_HAS_PPL flag set. > > > > Partial parity is the xor of not modified data chunks of a stripe and is > > calculated as follows: > > > > - reconstruct-write case: > > xor data from all not updated disks in a stripe > > > > - read-modify-write case: > > xor old data and parity from all updated disks in a stripe > > > > Implement it using the async_tx API and integrate into raid_run_ops(). > > It must be called when we still have access to old data, so do it when > > STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is > > stored into sh->ppl_page. > > > > Partial parity is not meaningful for full stripe write and is not stored > > in the log or used for recovery, so don't attempt to calculate it when > > stripe has STRIPE_FULL_WRITE. > > > > Put the PPL metadata structures to md_p.h because userspace tools > > (mdadm) will also need to read/write PPL. > > > > Warn about using PPL with enabled disk volatile write-back cache for > > now. It can be removed once disk cache flushing before writing PPL is > > implemented. > > > > Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> > > Sorry for the delay in getting to this for review... > > > +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log, > > + struct stripe_head *sh) > > +{ > > + struct ppl_conf *ppl_conf = log->ppl_conf; > > + struct ppl_io_unit *io; > > + struct ppl_header *pplhdr; > > + > > + io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC); > > + if (!io) > > + return NULL; > > + > > + memset(io, 0, sizeof(*io)); > > + io->log = log; > > + INIT_LIST_HEAD(&io->log_sibling); > > + INIT_LIST_HEAD(&io->stripe_list); > > + atomic_set(&io->pending_stripes, 0); > > + bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS); > > + > > + io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO); > > I'm trying to understand how these two mempool_alloc()s relate, and > particularly why the first one needs to be GFP_ATOMIC, while the second > one can safely be GFP_NOIO. > I see that the allocated memory is freed in different places: > header_page is called from the bi_endio function as soon as the write > completes, while 'io' is freed later. But I'm not sure that is enough > to make it safe. > > When working with mempools, you need to assume that the pool only > contains one element, and that every time you call mempool_alloc(), it > waits for that one element to be available. While that doesn't usually > happen, it is possible and if that case isn't handled correctly, the > system can deadlock. > > If no memory is available when this mempool_alloc() is called, it will > block. As it is called from the raid5d thread, the whole array will > block. So this can only complete safely is the write request has > already been submitted - or if there is some other workqueue which > submit requests after a timeout or similar. > I don't see that in the code. These ppl_io_unit structures can queue up > and are only submitted later by raid5d (I think). So if raid5d waits > for one to become free, it will wait forever. > > One easy way around this problem (assuming my understanding is correct) > is to just have a single mempool which allocates both a struct > ppl_io_unit and a page. You would need to define you own alloc/free > routines for the pool but that is easy enough. > > Then you only need a single mempool_alloc(), which can sensibly be > GFP_ATOMIC. > If that fails, you queue for later handling as you do now. If it > succeeds, then you continue to use the memory without any risk of > deadlocking. Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool does make sense. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html