Re: [PATCH v4 5/7] raid5-ppl: load and recover the log

Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> · Thu, 2 Mar 2017 09:40:47 +0100

On 03/01/2017 06:59 PM, Shaohua Li wrote:
> On Wed, Mar 01, 2017 at 05:14:18PM +0100, Artur Paszkiewicz wrote:
>> On 02/28/2017 02:12 AM, Shaohua Li wrote:
>>> On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote:
>>>> Load the log from each disk when starting the array and recover if the
>>>> array is dirty.
>>>>
>>>> The initial empty PPL is written by mdadm. When loading the log we
>>>> verify the header checksum and signature. For external metadata arrays
>>>> the signature is verified in userspace, so here we read it from the
>>>> header, verifying only if it matches on all disks, and use it later when
>>>> writing PPL.
>>>>
>>>> In addition to the header checksum, each header entry also contains a
>>>> checksum of its partial parity data. If the header is valid, recovery is
>>>> performed for each entry until an invalid entry is found. If the array
>>>> is not degraded and recovery using PPL fully succeeds, there is no need
>>>> to resync the array because data and parity will be consistent, so in
>>>> this case resync will be disabled.
>>>>
>>>> Due to compatibility with IMSM implementations on other systems, we
>>>> can't assume that the recovery data block size is always 4K. Writes
>>>> generated by MD raid5 don't have this issue, but when recovering PPL
>>>> written in other environments it is possible to have entries with
>>>> 512-byte sector granularity. The recovery code takes this into account
>>>> and also the logical sector size of the underlying drives.
>>>>
>>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx>
>>>> ---
>>>>  drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  drivers/md/raid5.c     |   5 +-
>>>>  2 files changed, 487 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>>>> index a00cabf1adf6..a7693353243a 100644
>>>> --- a/drivers/md/raid5-ppl.c
>>>> +++ b/drivers/md/raid5-ppl.c
>>>> @@ -16,6 +16,7 @@
>>>>  #include <linux/blkdev.h>
>>>>  #include <linux/slab.h>
>>>>  #include <linux/crc32c.h>
>>>> +#include <linux/async_tx.h>
>>>>  #include <linux/raid/md_p.h>
>>>>  #include "md.h"
>>>>  #include "raid5.h"
>>>> @@ -84,6 +85,10 @@ struct ppl_conf {
>>>>  	mempool_t *io_pool;
>>>>  	struct bio_set *bs;
>>>>  	mempool_t *meta_pool;
>>>> +
>>>> +	/* used only for recovery */
>>>> +	int recovered_entries;
>>>> +	int mismatch_count;
>>>>  };
>>>>  
>>>>  struct ppl_log {
>>>> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>>>>  		ppl_io_unit_finished(io);
>>>>  }
>>>>  
>>>> +static void ppl_xor(int size, struct page *page1, struct page *page2,
>>>> +		    struct page *page_result)
>>>> +{
>>>> +	struct async_submit_ctl submit;
>>>> +	struct dma_async_tx_descriptor *tx;
>>>> +	struct page *xor_srcs[] = { page1, page2 };
>>>> +
>>>> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
>>>> +			  NULL, NULL, NULL, NULL);
>>>> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
>>>> +
>>>> +	async_tx_quiesce(&tx);
>>>> +}
>>>> +
>>>> +/*
>>>> + * PPL recovery strategy: xor partial parity and data from all modified data
>>>> + * disks within a stripe and write the result as the new stripe parity. If all
>>>> + * stripe data disks are modified (full stripe write), no partial parity is
>>>> + * available, so just xor the data disks.
>>>> + *
>>>> + * Recovery of a PPL entry shall occur only if all modified data disks are
>>>> + * available and read from all of them succeeds.
>>>> + *
>>>> + * A PPL entry applies to a stripe, partial parity size for an entry is at most
>>>> + * the size of the chunk. Examples of possible cases for a single entry:
>>>> + *
>>>> + * case 0: single data disk write:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
>>>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * pp_size = data_size
>>>> + *
>>>> + * case 1: more than one data disk write:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
>>>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
>>>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * pp_size = data_size / modified_data_disks
>>>> + *
>>>> + * case 2: write to all data disks (also full stripe write):
>>>> + *   data0    data1    data2                parity
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * | ------ | ------ | ------ |           | (no change)        |
>>>> + * | -data- | -data- | -data- | --------> | xor all data       |
>>>> + * | ------ | ------ | ------ | --------> | (no change)        |
>>>> + * | ------ | ------ | ------ |           | (no change)        |
>>>> + * +--------+--------+--------+           +--------------------+
>>>> + * pp_size = 0
>>>> + *
>>>> + * The following cases are possible only in other implementations. The recovery
>>>> + * code can handle them, but they are not generated at runtime because they can
>>>> + * be reduced to cases 0, 1 and 2:
>>>> + *
>>>> + * case 3:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
>>>> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
>>>> + * | -data- | -data- | -data- | | -- | -> | xor all data       |
>>>> + * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * pp_size = chunk_size
>>>> + *
>>>> + * case 4:
>>>> + *   data0    data1    data2     ppl        parity
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
>>>> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
>>>> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
>>>> + * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
>>>> + * +--------+--------+--------+ +----+    +--------------------+
>>>> + * pp_size = chunk_size
>>>> + */
>>>> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
>>>> +			     sector_t ppl_sector)
>>>> +{
>>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>>> +	struct mddev *mddev = ppl_conf->mddev;
>>>> +	struct r5conf *conf = mddev->private;
>>>> +	int block_size = ppl_conf->block_size;
>>>> +	struct page *pages;
>>>> +	struct page *page1;
>>>> +	struct page *page2;
>>>> +	sector_t r_sector_first;
>>>> +	sector_t r_sector_last;
>>>> +	int strip_sectors;
>>>> +	int data_disks;
>>>> +	int i;
>>>> +	int ret = 0;
>>>> +	char b[BDEVNAME_SIZE];
>>>> +	unsigned int pp_size = le32_to_cpu(e->pp_size);
>>>> +	unsigned int data_size = le32_to_cpu(e->data_size);
>>>> +
>>>> +	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
>>>> +
>>>> +	if ((pp_size >> 9) < conf->chunk_sectors) {
>>>> +		if (pp_size > 0) {
>>>> +			data_disks = data_size / pp_size;
>>>> +			strip_sectors = pp_size >> 9;
>>>> +		} else {
>>>> +			data_disks = conf->raid_disks - conf->max_degraded;
>>>> +			strip_sectors = (data_size >> 9) / data_disks;
>>>> +		}
>>>> +		r_sector_last = r_sector_first +
>>>> +				(data_disks - 1) * conf->chunk_sectors +
>>>> +				strip_sectors;
>>>> +	} else {
>>>> +		data_disks = conf->raid_disks - conf->max_degraded;
>>>> +		strip_sectors = conf->chunk_sectors;
>>>> +		r_sector_last = r_sector_first + (data_size >> 9);
>>>> +	}
>>>> +
>>>> +	pages = alloc_pages(GFP_KERNEL, 1);
>>>
>>> order != 0 allocation should be avoid if possible. Here sounds not necessary
>>
>> OK, I will allocate and free them separately.
>>
>>>> +	if (!pages)
>>>> +		return -ENOMEM;
>>>> +	page1 = pages;
>>>> +	page2 = pages + 1;
>>>> +
>>>> +	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
>>>> +		 (unsigned long long)r_sector_first,
>>>> +		 (unsigned long long)r_sector_last);
>>>> +
>>>> +	/* if start and end is 4k aligned, use a 4k block */
>>>> +	if (block_size == 512 &&
>>>> +	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
>>>> +	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
>>>> +		block_size = STRIPE_SIZE;
>>>> +
>>>> +	/* iterate through blocks in strip */
>>>> +	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
>>>> +		bool update_parity = false;
>>>> +		sector_t parity_sector;
>>>> +		struct md_rdev *parity_rdev;
>>>> +		struct stripe_head sh;
>>>> +		int disk;
>>>> +		int indent = 0;
>>>> +
>>>> +		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
>>>> +		indent += 2;
>>>> +
>>>> +		memset(page_address(page1), 0, PAGE_SIZE);
>>>> +
>>>> +		/* iterate through data member disks */
>>>> +		for (disk = 0; disk < data_disks; disk++) {
>>>> +			int dd_idx;
>>>> +			struct md_rdev *rdev;
>>>> +			sector_t sector;
>>>> +			sector_t r_sector = r_sector_first + i +
>>>> +					    (disk * conf->chunk_sectors);
>>>> +
>>>> +			pr_debug("%s:%*s data member disk %d start\n",
>>>> +				 __func__, indent, "", disk);
>>>> +			indent += 2;
>>>> +
>>>> +			if (r_sector >= r_sector_last) {
>>>> +				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
>>>> +					 __func__, indent, "",
>>>> +					 (unsigned long long)r_sector);
>>>> +				indent -= 2;
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			update_parity = true;
>>>> +
>>>> +			/* map raid sector to member disk */
>>>> +			sector = raid5_compute_sector(conf, r_sector, 0,
>>>> +						      &dd_idx, NULL);
>>>> +			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
>>>> +				 __func__, indent, "",
>>>> +				 (unsigned long long)r_sector, dd_idx,
>>>> +				 (unsigned long long)sector);
>>>> +
>>>> +			rdev = conf->disks[dd_idx].rdev;
>>>> +			if (!rdev) {
>>>> +				pr_debug("%s:%*s data member disk %d missing\n",
>>>> +					 __func__, indent, "", dd_idx);
>>>> +				update_parity = false;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
>>>> +				 __func__, indent, "", bdevname(rdev->bdev, b),
>>>> +				 (unsigned long long)sector);
>>>> +			if (!sync_page_io(rdev, sector, block_size, page2,
>>>> +					REQ_OP_READ, 0, false)) {
>>>> +				md_error(mddev, rdev);
>>>> +				pr_debug("%s:%*s read failed!\n", __func__,
>>>> +					 indent, "");
>>>> +				ret = -EIO;
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			ppl_xor(block_size, page1, page2, page1);
>>>> +
>>>> +			indent -= 2;
>>>> +		}
>>>> +
>>>> +		if (!update_parity)
>>>> +			continue;
>>>> +
>>>> +		if (pp_size > 0) {
>>>> +			pr_debug("%s:%*s reading pp disk sector %llu\n",
>>>> +				 __func__, indent, "",
>>>> +				 (unsigned long long)(ppl_sector + i));
>>>> +			if (!sync_page_io(log->rdev,
>>>> +					ppl_sector - log->rdev->data_offset + i,
>>>> +					block_size, page2, REQ_OP_READ, 0,
>>>> +					false)) {
>>>> +				pr_debug("%s:%*s read failed!\n", __func__,
>>>> +					 indent, "");
>>>> +				md_error(mddev, log->rdev);
>>>> +				ret = -EIO;
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			ppl_xor(block_size, page1, page2, page1);
>>>> +		}
>>>> +
>>>> +		/* map raid sector to parity disk */
>>>> +		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
>>>> +				0, &disk, &sh);
>>>> +		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
>>>> +		parity_rdev = conf->disks[sh.pd_idx].rdev;
>>>> +
>>>> +		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
>>>> +		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
>>>> +			 __func__, indent, "",
>>>> +			 (unsigned long long)parity_sector,
>>>> +			 bdevname(parity_rdev->bdev, b));
>>>> +		if (!sync_page_io(parity_rdev, parity_sector, block_size,
>>>> +				page1, REQ_OP_WRITE, 0, false)) {
>>>> +			pr_debug("%s:%*s parity write error!\n", __func__,
>>>> +				 indent, "");
>>>> +			md_error(mddev, parity_rdev);
>>>> +			ret = -EIO;
>>>> +			goto out;
>>>> +		}
>>>> +	}
>>>> +out:
>>>> +	__free_pages(pages, 1);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
>>>> +{
>>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>>> +	struct md_rdev *rdev = log->rdev;
>>>> +	struct mddev *mddev = rdev->mddev;
>>>> +	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
>>>> +	struct page *page;
>>>> +	int i;
>>>> +	int ret = 0;
>>>> +
>>>> +	page = alloc_page(GFP_KERNEL);
>>>> +	if (!page)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	/* iterate through all PPL entries saved */
>>>> +	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
>>>> +		struct ppl_header_entry *e = &pplhdr->entries[i];
>>>> +		u32 pp_size = le32_to_cpu(e->pp_size);
>>>> +		sector_t sector = ppl_sector;
>>>> +		int ppl_entry_sectors = pp_size >> 9;
>>>> +		u32 crc, crc_stored;
>>>> +
>>>> +		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
>>>> +			 __func__, rdev->raid_disk, i,
>>>> +			 (unsigned long long)ppl_sector, pp_size);
>>>> +
>>>> +		crc = ~0;
>>>> +		crc_stored = le32_to_cpu(e->checksum);
>>>> +
>>>> +		/* read parial parity for this entry and calculate its checksum */
>>>> +		while (pp_size) {
>>>> +			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
>>>> +
>>>> +			if (!sync_page_io(rdev, sector - rdev->data_offset,
>>>> +					s, page, REQ_OP_READ, 0, false)) {
>>>> +				md_error(mddev, rdev);
>>>> +				ret = -EIO;
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			crc = crc32c_le(crc, page_address(page), s);
>>>> +
>>>> +			pp_size -= s;
>>>> +			sector += s >> 9;
>>>> +		}
>>>> +
>>>> +		crc = ~crc;
>>>> +
>>>> +		if (crc != crc_stored) {
>>>> +			/*
>>>> +			 * Don't recover this entry if the checksum does not
>>>> +			 * match, but keep going and try to recover other
>>>> +			 * entries.
>>>> +			 */
>>>> +			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
>>>> +				 __func__, crc_stored, crc);
>>>> +			ppl_conf->mismatch_count++;
>>>> +		} else {
>>>> +			ret = ppl_recover_entry(log, e, ppl_sector);
>>>> +			if (ret)
>>>> +				goto out;
>>>> +			ppl_conf->recovered_entries++;
>>>> +		}
>>>> +
>>>> +		ppl_sector += ppl_entry_sectors;
>>>> +	}
>>>> +out:
>>>> +	__free_page(page);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int ppl_write_empty_header(struct ppl_log *log)
>>>> +{
>>>> +	struct page *page;
>>>> +	struct ppl_header *pplhdr;
>>>> +	struct md_rdev *rdev = log->rdev;
>>>> +	int ret = 0;
>>>> +
>>>> +	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
>>>> +		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
>>>> +
>>>> +	page = alloc_page(GFP_NOIO | __GFP_ZERO);
>>>> +	if (!page)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	pplhdr = page_address(page);
>>>> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
>>>> +	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
>>>> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
>>>> +
>>>> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
>>>> +			  PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) {
>>>
>>> write with FUA? otherwise, the empty header might not set down in the disks.
>>
>> OK.
>>
>>>> +		md_error(rdev->mddev, rdev);
>>>> +		ret = -EIO;
>>>> +	}
>>>> +
>>>> +	__free_page(page);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int ppl_load_distributed(struct ppl_log *log)
>>>> +{
>>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>>> +	struct md_rdev *rdev = log->rdev;
>>>> +	struct mddev *mddev = rdev->mddev;
>>>> +	struct page *page;
>>>> +	struct ppl_header *pplhdr;
>>>> +	u32 crc, crc_stored;
>>>> +	u32 signature;
>>>> +	int ret = 0;
>>>> +
>>>> +	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
>>>> +
>>>> +	/* read PPL header */
>>>> +	page = alloc_page(GFP_KERNEL);
>>>> +	if (!page)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
>>>> +			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
>>>> +		md_error(mddev, rdev);
>>>> +		ret = -EIO;
>>>> +		goto out;
>>>> +	}
>>>> +	pplhdr = page_address(page);
>>>> +
>>>> +	/* check header validity */
>>>> +	crc_stored = le32_to_cpu(pplhdr->checksum);
>>>> +	pplhdr->checksum = 0;
>>>> +	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
>>>> +
>>>> +	if (crc_stored != crc) {
>>>> +		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
>>>> +			 __func__, crc_stored, crc);
>>>> +		ppl_conf->mismatch_count++;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	signature = le32_to_cpu(pplhdr->signature);
>>>> +
>>>> +	if (mddev->external) {
>>>> +		/*
>>>> +		 * For external metadata the header signature is set and
>>>> +		 * validated in userspace.
>>>> +		 */
>>>> +		ppl_conf->signature = signature;
>>>> +	} else if (ppl_conf->signature != signature) {
>>>> +		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
>>>> +			 __func__, signature, ppl_conf->signature);
>>>> +		ppl_conf->mismatch_count++;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	/* attempt to recover from log if we are starting a dirty array */
>>>> +	if (!mddev->pers && mddev->recovery_cp != MaxSector)
>>>> +		ret = ppl_recover(log, pplhdr);
>>>
>>> when could mddev->pers be null?
>>
>> When the array is inactive - we are starting the array with ppl, not
>> enabling ppl at runtime.
> 
> Didn't follow this. Can explain more?

mddev->pers is null if this is executed from md_run()/raid5_run().
That's when the array is just about to be started and we want to perform
the ppl recovery. If this is called from
raid5_change_consistency_policy(), mddev->pers is not null and recovery
should not be attempted, because the array is active.

Thanks,
Artur
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html