Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation

Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> · Tue, 28 Mar 2017 16:12:37 +0200

On 03/24/2017 05:46 PM, Shaohua Li wrote:
> On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote:
>> On Thu, Mar 09 2017, Artur Paszkiewicz wrote:
>>
>>> Implement the calculation of partial parity for a stripe and PPL write
>>> logging functionality. The description of PPL is added to the
>>> documentation. More details can be found in the comments in raid5-ppl.c.
>>>
>>> Attach a page for holding the partial parity data to stripe_head.
>>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>>
>>> Partial parity is the xor of not modified data chunks of a stripe and is
>>> calculated as follows:
>>>
>>> - reconstruct-write case:
>>>   xor data from all not updated disks in a stripe
>>>
>>> - read-modify-write case:
>>>   xor old data and parity from all updated disks in a stripe
>>>
>>> Implement it using the async_tx API and integrate into raid_run_ops().
>>> It must be called when we still have access to old data, so do it when
>>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>>> stored into sh->ppl_page.
>>>
>>> Partial parity is not meaningful for full stripe write and is not stored
>>> in the log or used for recovery, so don't attempt to calculate it when
>>> stripe has STRIPE_FULL_WRITE.
>>>
>>> Put the PPL metadata structures to md_p.h because userspace tools
>>> (mdadm) will also need to read/write PPL.
>>>
>>> Warn about using PPL with enabled disk volatile write-back cache for
>>> now. It can be removed once disk cache flushing before writing PPL is
>>> implemented.
>>>
>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx>
>>
>> Sorry for the delay in getting to this for review...
>>
>>> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>>> +					  struct stripe_head *sh)
>>> +{
>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>> +	struct ppl_io_unit *io;
>>> +	struct ppl_header *pplhdr;
>>> +
>>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
>>> +	if (!io)
>>> +		return NULL;
>>> +
>>> +	memset(io, 0, sizeof(*io));
>>> +	io->log = log;
>>> +	INIT_LIST_HEAD(&io->log_sibling);
>>> +	INIT_LIST_HEAD(&io->stripe_list);
>>> +	atomic_set(&io->pending_stripes, 0);
>>> +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
>>> +
>>> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
>>
>> I'm trying to understand how these two mempool_alloc()s relate, and
>> particularly why the first one needs to be GFP_ATOMIC, while the second
>> one can safely be GFP_NOIO.
>> I see that the allocated memory is freed in different places:
>> header_page is called from the bi_endio function as soon as the write
>> completes, while 'io' is freed later.  But I'm not sure that is enough
>> to make it safe.
>>
>> When working with mempools, you need to assume that the pool only
>> contains one element, and that every time you call mempool_alloc(), it
>> waits for that one element to be available.  While that doesn't usually
>> happen, it is possible and if that case isn't handled correctly, the
>> system can deadlock.
>>
>> If no memory is available when this mempool_alloc() is called, it will
>> block.  As it is called from the raid5d thread, the whole array will
>> block.  So this can only complete safely is the write request has
>> already been submitted - or if there is some other workqueue which
>> submit requests after a timeout or similar.
>> I don't see that in the code.  These ppl_io_unit structures can queue up
>> and are only submitted later by raid5d (I think).  So if raid5d waits
>> for one to become free, it will wait forever.
>>
>> One easy way around this problem (assuming my understanding is correct)
>> is to just have a single mempool which allocates both a struct
>> ppl_io_unit and a page.  You would need to define you own alloc/free
>> routines for the pool but that is easy enough.
>>
>> Then you only need a single mempool_alloc(), which can sensibly be
>> GFP_ATOMIC.
>> If that fails, you queue for later handling as you do now.  If it
>> succeeds, then you continue to use the memory without any risk of
>> deadlocking.
> 
> Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with
> commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool
> does make sense.

That's right, I based this on raid5-cache code. I like the approach that
Neil proposed, I'll test it some more and probably send a patch soon. Do
you think GFP_ATOMIC is necessary here or would GFP_NOWAIT be enough?

Thanks,
Artur
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html