Re: [PATCH v5 5/7] raid5-ppl: load and recover the log

Shaohua Li <shli@xxxxxxxxxx> · Thu, 9 Mar 2017 15:30:58 -0800

On Thu, Mar 09, 2017 at 10:00:01AM +0100, Artur Paszkiewicz wrote:
> Load the log from each disk when starting the array and recover if the
> array is dirty.
> 
> The initial empty PPL is written by mdadm. When loading the log we
> verify the header checksum and signature. For external metadata arrays
> the signature is verified in userspace, so here we read it from the
> header, verifying only if it matches on all disks, and use it later when
> writing PPL.
> 
> In addition to the header checksum, each header entry also contains a
> checksum of its partial parity data. If the header is valid, recovery is
> performed for each entry until an invalid entry is found. If the array
> is not degraded and recovery using PPL fully succeeds, there is no need
> to resync the array because data and parity will be consistent, so in
> this case resync will be disabled.
> 
> Due to compatibility with IMSM implementations on other systems, we
> can't assume that the recovery data block size is always 4K. Writes
> generated by MD raid5 don't have this issue, but when recovering PPL
> written in other environments it is possible to have entries with
> 512-byte sector granularity. The recovery code takes this into account
> and also the logical sector size of the underlying drives.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx>
> ---
>  drivers/md/raid5-ppl.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c     |   5 +-
>  2 files changed, 501 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index 92783586743d..548d1028a3ce 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -103,6 +103,10 @@ struct ppl_conf {
>  	mempool_t *io_pool;
>  	struct bio_set *bs;
>  	mempool_t *meta_pool;
> +
> +	/* used only for recovery */
> +	int recovered_entries;
> +	int mismatch_count;
>  };
>  
>  struct ppl_log {
> @@ -514,6 +518,482 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>  		ppl_io_unit_finished(io);
>  }
>  
> +static void ppl_xor(int size, struct page *page1, struct page *page2,
> +		    struct page *page_result)
> +{

I'd remove the page_result parameter, it should always be page1. And this will
make it clear why we need ASYNC_TX_XOR_DROP_DST.

> +	struct async_submit_ctl submit;
> +	struct dma_async_tx_descriptor *tx;
> +	struct page *xor_srcs[] = { page1, page2 };
> +
> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
> +			  NULL, NULL, NULL, NULL);
> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
> +
> +	async_tx_quiesce(&tx);
> +}

...
> +			ret = ppl_recover_entry(log, e, ppl_sector);
> +			if (ret)
> +				goto out;
> +			ppl_conf->recovered_entries++;
> +		}
> +
> +		ppl_sector += ppl_entry_sectors;
> +	}
> +
> +	/* flush the disk cache after recovery if necessary */
> +	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {

The block layer will handle this, so you don't need to check

> +		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
> +
> +		bio->bi_bdev = rdev->bdev;
> +		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> +		ret = submit_bio_wait(bio);
> +		bio_put(bio);

please use blkdev_issue_flush() instead.

> +	}
> +out:
> +	__free_page(page);
> +	return ret;
> +}
> +
...
> +static int ppl_load(struct ppl_conf *ppl_conf)
> +{
> +	int ret = 0;
> +	u32 signature = 0;
> +	bool signature_set = false;
> +	int i;
> +
> +	for (i = 0; i < ppl_conf->count; i++) {
> +		struct ppl_log *log = &ppl_conf->child_logs[i];
> +
> +		/* skip missing drive */
> +		if (!log->rdev)
> +			continue;
> +
> +		ret = ppl_load_distributed(log);
> +		if (ret)
> +			break;

Not sure about the strategy here. But if one disk fails, why don't we continue
do the recovery from other disks? This way we can at least recovery more data.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html