Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation

Shaohua Li <shli@xxxxxxxxxx> · Fri, 24 Mar 2017 09:46:51 -0700

On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote:
> On Thu, Mar 09 2017, Artur Paszkiewicz wrote:
> 
> > Implement the calculation of partial parity for a stripe and PPL write
> > logging functionality. The description of PPL is added to the
> > documentation. More details can be found in the comments in raid5-ppl.c.
> >
> > Attach a page for holding the partial parity data to stripe_head.
> > Allocate it only if mddev has the MD_HAS_PPL flag set.
> >
> > Partial parity is the xor of not modified data chunks of a stripe and is
> > calculated as follows:
> >
> > - reconstruct-write case:
> >   xor data from all not updated disks in a stripe
> >
> > - read-modify-write case:
> >   xor old data and parity from all updated disks in a stripe
> >
> > Implement it using the async_tx API and integrate into raid_run_ops().
> > It must be called when we still have access to old data, so do it when
> > STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> > stored into sh->ppl_page.
> >
> > Partial parity is not meaningful for full stripe write and is not stored
> > in the log or used for recovery, so don't attempt to calculate it when
> > stripe has STRIPE_FULL_WRITE.
> >
> > Put the PPL metadata structures to md_p.h because userspace tools
> > (mdadm) will also need to read/write PPL.
> >
> > Warn about using PPL with enabled disk volatile write-back cache for
> > now. It can be removed once disk cache flushing before writing PPL is
> > implemented.
> >
> > Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx>
> 
> Sorry for the delay in getting to this for review...
> 
> > +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> > +					  struct stripe_head *sh)
> > +{
> > +	struct ppl_conf *ppl_conf = log->ppl_conf;
> > +	struct ppl_io_unit *io;
> > +	struct ppl_header *pplhdr;
> > +
> > +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> > +	if (!io)
> > +		return NULL;
> > +
> > +	memset(io, 0, sizeof(*io));
> > +	io->log = log;
> > +	INIT_LIST_HEAD(&io->log_sibling);
> > +	INIT_LIST_HEAD(&io->stripe_list);
> > +	atomic_set(&io->pending_stripes, 0);
> > +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
> > +
> > +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
> 
> I'm trying to understand how these two mempool_alloc()s relate, and
> particularly why the first one needs to be GFP_ATOMIC, while the second
> one can safely be GFP_NOIO.
> I see that the allocated memory is freed in different places:
> header_page is called from the bi_endio function as soon as the write
> completes, while 'io' is freed later.  But I'm not sure that is enough
> to make it safe.
> 
> When working with mempools, you need to assume that the pool only
> contains one element, and that every time you call mempool_alloc(), it
> waits for that one element to be available.  While that doesn't usually
> happen, it is possible and if that case isn't handled correctly, the
> system can deadlock.
> 
> If no memory is available when this mempool_alloc() is called, it will
> block.  As it is called from the raid5d thread, the whole array will
> block.  So this can only complete safely is the write request has
> already been submitted - or if there is some other workqueue which
> submit requests after a timeout or similar.
> I don't see that in the code.  These ppl_io_unit structures can queue up
> and are only submitted later by raid5d (I think).  So if raid5d waits
> for one to become free, it will wait forever.
> 
> One easy way around this problem (assuming my understanding is correct)
> is to just have a single mempool which allocates both a struct
> ppl_io_unit and a page.  You would need to define you own alloc/free
> routines for the pool but that is easy enough.
> 
> Then you only need a single mempool_alloc(), which can sensibly be
> GFP_ATOMIC.
> If that fails, you queue for later handling as you do now.  If it
> succeeds, then you continue to use the memory without any risk of
> deadlocking.

Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with
commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool
does make sense.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html