On Tue, Mar 28, 2017 at 04:12:37PM +0200, Artur Paszkiewicz wrote: > On 03/24/2017 05:46 PM, Shaohua Li wrote: > > On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote: > >> On Thu, Mar 09 2017, Artur Paszkiewicz wrote: > >> > >>> Implement the calculation of partial parity for a stripe and PPL write > >>> logging functionality. The description of PPL is added to the > >>> documentation. More details can be found in the comments in raid5-ppl.c. > >>> > >>> Attach a page for holding the partial parity data to stripe_head. > >>> Allocate it only if mddev has the MD_HAS_PPL flag set. > >>> > >>> Partial parity is the xor of not modified data chunks of a stripe and is > >>> calculated as follows: > >>> > >>> - reconstruct-write case: > >>> xor data from all not updated disks in a stripe > >>> > >>> - read-modify-write case: > >>> xor old data and parity from all updated disks in a stripe > >>> > >>> Implement it using the async_tx API and integrate into raid_run_ops(). > >>> It must be called when we still have access to old data, so do it when > >>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is > >>> stored into sh->ppl_page. > >>> > >>> Partial parity is not meaningful for full stripe write and is not stored > >>> in the log or used for recovery, so don't attempt to calculate it when > >>> stripe has STRIPE_FULL_WRITE. > >>> > >>> Put the PPL metadata structures to md_p.h because userspace tools > >>> (mdadm) will also need to read/write PPL. > >>> > >>> Warn about using PPL with enabled disk volatile write-back cache for > >>> now. It can be removed once disk cache flushing before writing PPL is > >>> implemented. > >>> > >>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> > >> > >> Sorry for the delay in getting to this for review... > >> > >>> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log, > >>> + struct stripe_head *sh) > >>> +{ > >>> + struct ppl_conf *ppl_conf = log->ppl_conf; > >>> + struct ppl_io_unit *io; > >>> + struct ppl_header *pplhdr; > >>> + > >>> + io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC); > >>> + if (!io) > >>> + return NULL; > >>> + > >>> + memset(io, 0, sizeof(*io)); > >>> + io->log = log; > >>> + INIT_LIST_HEAD(&io->log_sibling); > >>> + INIT_LIST_HEAD(&io->stripe_list); > >>> + atomic_set(&io->pending_stripes, 0); > >>> + bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS); > >>> + > >>> + io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO); > >> > >> I'm trying to understand how these two mempool_alloc()s relate, and > >> particularly why the first one needs to be GFP_ATOMIC, while the second > >> one can safely be GFP_NOIO. > >> I see that the allocated memory is freed in different places: > >> header_page is called from the bi_endio function as soon as the write > >> completes, while 'io' is freed later. But I'm not sure that is enough > >> to make it safe. > >> > >> When working with mempools, you need to assume that the pool only > >> contains one element, and that every time you call mempool_alloc(), it > >> waits for that one element to be available. While that doesn't usually > >> happen, it is possible and if that case isn't handled correctly, the > >> system can deadlock. > >> > >> If no memory is available when this mempool_alloc() is called, it will > >> block. As it is called from the raid5d thread, the whole array will > >> block. So this can only complete safely is the write request has > >> already been submitted - or if there is some other workqueue which > >> submit requests after a timeout or similar. > >> I don't see that in the code. These ppl_io_unit structures can queue up > >> and are only submitted later by raid5d (I think). So if raid5d waits > >> for one to become free, it will wait forever. > >> > >> One easy way around this problem (assuming my understanding is correct) > >> is to just have a single mempool which allocates both a struct > >> ppl_io_unit and a page. You would need to define you own alloc/free > >> routines for the pool but that is easy enough. > >> > >> Then you only need a single mempool_alloc(), which can sensibly be > >> GFP_ATOMIC. > >> If that fails, you queue for later handling as you do now. If it > >> succeeds, then you continue to use the memory without any risk of > >> deadlocking. > > > > Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with > > commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool > > does make sense. > > That's right, I based this on raid5-cache code. I like the approach that > Neil proposed, I'll test it some more and probably send a patch soon. Do > you think GFP_ATOMIC is necessary here or would GFP_NOWAIT be enough? Since we can handle error, I don't see the reason why we should use GFP_ATOMIC, GFP_NOWAIT is good to me. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html