On 03/24/2017 05:46 PM, Shaohua Li wrote: > On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote: >> On Thu, Mar 09 2017, Artur Paszkiewicz wrote: >> >>> Implement the calculation of partial parity for a stripe and PPL write >>> logging functionality. The description of PPL is added to the >>> documentation. More details can be found in the comments in raid5-ppl.c. >>> >>> Attach a page for holding the partial parity data to stripe_head. >>> Allocate it only if mddev has the MD_HAS_PPL flag set. >>> >>> Partial parity is the xor of not modified data chunks of a stripe and is >>> calculated as follows: >>> >>> - reconstruct-write case: >>> xor data from all not updated disks in a stripe >>> >>> - read-modify-write case: >>> xor old data and parity from all updated disks in a stripe >>> >>> Implement it using the async_tx API and integrate into raid_run_ops(). >>> It must be called when we still have access to old data, so do it when >>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is >>> stored into sh->ppl_page. >>> >>> Partial parity is not meaningful for full stripe write and is not stored >>> in the log or used for recovery, so don't attempt to calculate it when >>> stripe has STRIPE_FULL_WRITE. >>> >>> Put the PPL metadata structures to md_p.h because userspace tools >>> (mdadm) will also need to read/write PPL. >>> >>> Warn about using PPL with enabled disk volatile write-back cache for >>> now. It can be removed once disk cache flushing before writing PPL is >>> implemented. >>> >>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> >> >> Sorry for the delay in getting to this for review... >> >>> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log, >>> + struct stripe_head *sh) >>> +{ >>> + struct ppl_conf *ppl_conf = log->ppl_conf; >>> + struct ppl_io_unit *io; >>> + struct ppl_header *pplhdr; >>> + >>> + io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC); >>> + if (!io) >>> + return NULL; >>> + >>> + memset(io, 0, sizeof(*io)); >>> + io->log = log; >>> + INIT_LIST_HEAD(&io->log_sibling); >>> + INIT_LIST_HEAD(&io->stripe_list); >>> + atomic_set(&io->pending_stripes, 0); >>> + bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS); >>> + >>> + io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO); >> >> I'm trying to understand how these two mempool_alloc()s relate, and >> particularly why the first one needs to be GFP_ATOMIC, while the second >> one can safely be GFP_NOIO. >> I see that the allocated memory is freed in different places: >> header_page is called from the bi_endio function as soon as the write >> completes, while 'io' is freed later. But I'm not sure that is enough >> to make it safe. >> >> When working with mempools, you need to assume that the pool only >> contains one element, and that every time you call mempool_alloc(), it >> waits for that one element to be available. While that doesn't usually >> happen, it is possible and if that case isn't handled correctly, the >> system can deadlock. >> >> If no memory is available when this mempool_alloc() is called, it will >> block. As it is called from the raid5d thread, the whole array will >> block. So this can only complete safely is the write request has >> already been submitted - or if there is some other workqueue which >> submit requests after a timeout or similar. >> I don't see that in the code. These ppl_io_unit structures can queue up >> and are only submitted later by raid5d (I think). So if raid5d waits >> for one to become free, it will wait forever. >> >> One easy way around this problem (assuming my understanding is correct) >> is to just have a single mempool which allocates both a struct >> ppl_io_unit and a page. You would need to define you own alloc/free >> routines for the pool but that is easy enough. >> >> Then you only need a single mempool_alloc(), which can sensibly be >> GFP_ATOMIC. >> If that fails, you queue for later handling as you do now. If it >> succeeds, then you continue to use the memory without any risk of >> deadlocking. > > Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with > commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool > does make sense. That's right, I based this on raid5-cache code. I like the approach that Neil proposed, I'll test it some more and probably send a patch soon. Do you think GFP_ATOMIC is necessary here or would GFP_NOWAIT be enough? Thanks, Artur -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html