On 03/01/2017 06:59 PM, Shaohua Li wrote: > On Wed, Mar 01, 2017 at 05:14:18PM +0100, Artur Paszkiewicz wrote: >> On 02/28/2017 02:12 AM, Shaohua Li wrote: >>> On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote: >>>> Load the log from each disk when starting the array and recover if the >>>> array is dirty. >>>> >>>> The initial empty PPL is written by mdadm. When loading the log we >>>> verify the header checksum and signature. For external metadata arrays >>>> the signature is verified in userspace, so here we read it from the >>>> header, verifying only if it matches on all disks, and use it later when >>>> writing PPL. >>>> >>>> In addition to the header checksum, each header entry also contains a >>>> checksum of its partial parity data. If the header is valid, recovery is >>>> performed for each entry until an invalid entry is found. If the array >>>> is not degraded and recovery using PPL fully succeeds, there is no need >>>> to resync the array because data and parity will be consistent, so in >>>> this case resync will be disabled. >>>> >>>> Due to compatibility with IMSM implementations on other systems, we >>>> can't assume that the recovery data block size is always 4K. Writes >>>> generated by MD raid5 don't have this issue, but when recovering PPL >>>> written in other environments it is possible to have entries with >>>> 512-byte sector granularity. The recovery code takes this into account >>>> and also the logical sector size of the underlying drives. >>>> >>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> >>>> --- >>>> drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++ >>>> drivers/md/raid5.c | 5 +- >>>> 2 files changed, 487 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c >>>> index a00cabf1adf6..a7693353243a 100644 >>>> --- a/drivers/md/raid5-ppl.c >>>> +++ b/drivers/md/raid5-ppl.c >>>> @@ -16,6 +16,7 @@ >>>> #include <linux/blkdev.h> >>>> #include <linux/slab.h> >>>> #include <linux/crc32c.h> >>>> +#include <linux/async_tx.h> >>>> #include <linux/raid/md_p.h> >>>> #include "md.h" >>>> #include "raid5.h" >>>> @@ -84,6 +85,10 @@ struct ppl_conf { >>>> mempool_t *io_pool; >>>> struct bio_set *bs; >>>> mempool_t *meta_pool; >>>> + >>>> + /* used only for recovery */ >>>> + int recovered_entries; >>>> + int mismatch_count; >>>> }; >>>> >>>> struct ppl_log { >>>> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh) >>>> ppl_io_unit_finished(io); >>>> } >>>> >>>> +static void ppl_xor(int size, struct page *page1, struct page *page2, >>>> + struct page *page_result) >>>> +{ >>>> + struct async_submit_ctl submit; >>>> + struct dma_async_tx_descriptor *tx; >>>> + struct page *xor_srcs[] = { page1, page2 }; >>>> + >>>> + init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST, >>>> + NULL, NULL, NULL, NULL); >>>> + tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit); >>>> + >>>> + async_tx_quiesce(&tx); >>>> +} >>>> + >>>> +/* >>>> + * PPL recovery strategy: xor partial parity and data from all modified data >>>> + * disks within a stripe and write the result as the new stripe parity. If all >>>> + * stripe data disks are modified (full stripe write), no partial parity is >>>> + * available, so just xor the data disks. >>>> + * >>>> + * Recovery of a PPL entry shall occur only if all modified data disks are >>>> + * available and read from all of them succeeds. >>>> + * >>>> + * A PPL entry applies to a stripe, partial parity size for an entry is at most >>>> + * the size of the chunk. Examples of possible cases for a single entry: >>>> + * >>>> + * case 0: single data disk write: >>>> + * data0 data1 data2 ppl parity >>>> + * +--------+--------+--------+ +--------------------+ >>>> + * | ------ | ------ | ------ | +----+ | (no change) | >>>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp | >>>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp | >>>> + * | ------ | ------ | ------ | +----+ | (no change) | >>>> + * +--------+--------+--------+ +--------------------+ >>>> + * pp_size = data_size >>>> + * >>>> + * case 1: more than one data disk write: >>>> + * data0 data1 data2 ppl parity >>>> + * +--------+--------+--------+ +--------------------+ >>>> + * | ------ | ------ | ------ | +----+ | (no change) | >>>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp | >>>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp | >>>> + * | ------ | ------ | ------ | +----+ | (no change) | >>>> + * +--------+--------+--------+ +--------------------+ >>>> + * pp_size = data_size / modified_data_disks >>>> + * >>>> + * case 2: write to all data disks (also full stripe write): >>>> + * data0 data1 data2 parity >>>> + * +--------+--------+--------+ +--------------------+ >>>> + * | ------ | ------ | ------ | | (no change) | >>>> + * | -data- | -data- | -data- | --------> | xor all data | >>>> + * | ------ | ------ | ------ | --------> | (no change) | >>>> + * | ------ | ------ | ------ | | (no change) | >>>> + * +--------+--------+--------+ +--------------------+ >>>> + * pp_size = 0 >>>> + * >>>> + * The following cases are possible only in other implementations. The recovery >>>> + * code can handle them, but they are not generated at runtime because they can >>>> + * be reduced to cases 0, 1 and 2: >>>> + * >>>> + * case 3: >>>> + * data0 data1 data2 ppl parity >>>> + * +--------+--------+--------+ +----+ +--------------------+ >>>> + * | ------ | -data- | -data- | | pp | | data1 ^ data2 ^ pp | >>>> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp | >>>> + * | -data- | -data- | -data- | | -- | -> | xor all data | >>>> + * | -data- | -data- | ------ | | pp | | data0 ^ data1 ^ pp | >>>> + * +--------+--------+--------+ +----+ +--------------------+ >>>> + * pp_size = chunk_size >>>> + * >>>> + * case 4: >>>> + * data0 data1 data2 ppl parity >>>> + * +--------+--------+--------+ +----+ +--------------------+ >>>> + * | ------ | -data- | ------ | | pp | | data1 ^ pp | >>>> + * | ------ | ------ | ------ | | -- | -> | (no change) | >>>> + * | ------ | ------ | ------ | | -- | -> | (no change) | >>>> + * | -data- | ------ | ------ | | pp | | data0 ^ pp | >>>> + * +--------+--------+--------+ +----+ +--------------------+ >>>> + * pp_size = chunk_size >>>> + */ >>>> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e, >>>> + sector_t ppl_sector) >>>> +{ >>>> + struct ppl_conf *ppl_conf = log->ppl_conf; >>>> + struct mddev *mddev = ppl_conf->mddev; >>>> + struct r5conf *conf = mddev->private; >>>> + int block_size = ppl_conf->block_size; >>>> + struct page *pages; >>>> + struct page *page1; >>>> + struct page *page2; >>>> + sector_t r_sector_first; >>>> + sector_t r_sector_last; >>>> + int strip_sectors; >>>> + int data_disks; >>>> + int i; >>>> + int ret = 0; >>>> + char b[BDEVNAME_SIZE]; >>>> + unsigned int pp_size = le32_to_cpu(e->pp_size); >>>> + unsigned int data_size = le32_to_cpu(e->data_size); >>>> + >>>> + r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9); >>>> + >>>> + if ((pp_size >> 9) < conf->chunk_sectors) { >>>> + if (pp_size > 0) { >>>> + data_disks = data_size / pp_size; >>>> + strip_sectors = pp_size >> 9; >>>> + } else { >>>> + data_disks = conf->raid_disks - conf->max_degraded; >>>> + strip_sectors = (data_size >> 9) / data_disks; >>>> + } >>>> + r_sector_last = r_sector_first + >>>> + (data_disks - 1) * conf->chunk_sectors + >>>> + strip_sectors; >>>> + } else { >>>> + data_disks = conf->raid_disks - conf->max_degraded; >>>> + strip_sectors = conf->chunk_sectors; >>>> + r_sector_last = r_sector_first + (data_size >> 9); >>>> + } >>>> + >>>> + pages = alloc_pages(GFP_KERNEL, 1); >>> >>> order != 0 allocation should be avoid if possible. Here sounds not necessary >> >> OK, I will allocate and free them separately. >> >>>> + if (!pages) >>>> + return -ENOMEM; >>>> + page1 = pages; >>>> + page2 = pages + 1; >>>> + >>>> + pr_debug("%s: array sector first: %llu last: %llu\n", __func__, >>>> + (unsigned long long)r_sector_first, >>>> + (unsigned long long)r_sector_last); >>>> + >>>> + /* if start and end is 4k aligned, use a 4k block */ >>>> + if (block_size == 512 && >>>> + (r_sector_first & (STRIPE_SECTORS - 1)) == 0 && >>>> + (r_sector_last & (STRIPE_SECTORS - 1)) == 0) >>>> + block_size = STRIPE_SIZE; >>>> + >>>> + /* iterate through blocks in strip */ >>>> + for (i = 0; i < strip_sectors; i += (block_size >> 9)) { >>>> + bool update_parity = false; >>>> + sector_t parity_sector; >>>> + struct md_rdev *parity_rdev; >>>> + struct stripe_head sh; >>>> + int disk; >>>> + int indent = 0; >>>> + >>>> + pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i); >>>> + indent += 2; >>>> + >>>> + memset(page_address(page1), 0, PAGE_SIZE); >>>> + >>>> + /* iterate through data member disks */ >>>> + for (disk = 0; disk < data_disks; disk++) { >>>> + int dd_idx; >>>> + struct md_rdev *rdev; >>>> + sector_t sector; >>>> + sector_t r_sector = r_sector_first + i + >>>> + (disk * conf->chunk_sectors); >>>> + >>>> + pr_debug("%s:%*s data member disk %d start\n", >>>> + __func__, indent, "", disk); >>>> + indent += 2; >>>> + >>>> + if (r_sector >= r_sector_last) { >>>> + pr_debug("%s:%*s array sector %llu doesn't need parity update\n", >>>> + __func__, indent, "", >>>> + (unsigned long long)r_sector); >>>> + indent -= 2; >>>> + continue; >>>> + } >>>> + >>>> + update_parity = true; >>>> + >>>> + /* map raid sector to member disk */ >>>> + sector = raid5_compute_sector(conf, r_sector, 0, >>>> + &dd_idx, NULL); >>>> + pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n", >>>> + __func__, indent, "", >>>> + (unsigned long long)r_sector, dd_idx, >>>> + (unsigned long long)sector); >>>> + >>>> + rdev = conf->disks[dd_idx].rdev; >>>> + if (!rdev) { >>>> + pr_debug("%s:%*s data member disk %d missing\n", >>>> + __func__, indent, "", dd_idx); >>>> + update_parity = false; >>>> + break; >>>> + } >>>> + >>>> + pr_debug("%s:%*s reading data member disk %s sector %llu\n", >>>> + __func__, indent, "", bdevname(rdev->bdev, b), >>>> + (unsigned long long)sector); >>>> + if (!sync_page_io(rdev, sector, block_size, page2, >>>> + REQ_OP_READ, 0, false)) { >>>> + md_error(mddev, rdev); >>>> + pr_debug("%s:%*s read failed!\n", __func__, >>>> + indent, ""); >>>> + ret = -EIO; >>>> + goto out; >>>> + } >>>> + >>>> + ppl_xor(block_size, page1, page2, page1); >>>> + >>>> + indent -= 2; >>>> + } >>>> + >>>> + if (!update_parity) >>>> + continue; >>>> + >>>> + if (pp_size > 0) { >>>> + pr_debug("%s:%*s reading pp disk sector %llu\n", >>>> + __func__, indent, "", >>>> + (unsigned long long)(ppl_sector + i)); >>>> + if (!sync_page_io(log->rdev, >>>> + ppl_sector - log->rdev->data_offset + i, >>>> + block_size, page2, REQ_OP_READ, 0, >>>> + false)) { >>>> + pr_debug("%s:%*s read failed!\n", __func__, >>>> + indent, ""); >>>> + md_error(mddev, log->rdev); >>>> + ret = -EIO; >>>> + goto out; >>>> + } >>>> + >>>> + ppl_xor(block_size, page1, page2, page1); >>>> + } >>>> + >>>> + /* map raid sector to parity disk */ >>>> + parity_sector = raid5_compute_sector(conf, r_sector_first + i, >>>> + 0, &disk, &sh); >>>> + BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk)); >>>> + parity_rdev = conf->disks[sh.pd_idx].rdev; >>>> + >>>> + BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev); >>>> + pr_debug("%s:%*s write parity at sector %llu, disk %s\n", >>>> + __func__, indent, "", >>>> + (unsigned long long)parity_sector, >>>> + bdevname(parity_rdev->bdev, b)); >>>> + if (!sync_page_io(parity_rdev, parity_sector, block_size, >>>> + page1, REQ_OP_WRITE, 0, false)) { >>>> + pr_debug("%s:%*s parity write error!\n", __func__, >>>> + indent, ""); >>>> + md_error(mddev, parity_rdev); >>>> + ret = -EIO; >>>> + goto out; >>>> + } >>>> + } >>>> +out: >>>> + __free_pages(pages, 1); >>>> + return ret; >>>> +} >>>> + >>>> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr) >>>> +{ >>>> + struct ppl_conf *ppl_conf = log->ppl_conf; >>>> + struct md_rdev *rdev = log->rdev; >>>> + struct mddev *mddev = rdev->mddev; >>>> + sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9); >>>> + struct page *page; >>>> + int i; >>>> + int ret = 0; >>>> + >>>> + page = alloc_page(GFP_KERNEL); >>>> + if (!page) >>>> + return -ENOMEM; >>>> + >>>> + /* iterate through all PPL entries saved */ >>>> + for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) { >>>> + struct ppl_header_entry *e = &pplhdr->entries[i]; >>>> + u32 pp_size = le32_to_cpu(e->pp_size); >>>> + sector_t sector = ppl_sector; >>>> + int ppl_entry_sectors = pp_size >> 9; >>>> + u32 crc, crc_stored; >>>> + >>>> + pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n", >>>> + __func__, rdev->raid_disk, i, >>>> + (unsigned long long)ppl_sector, pp_size); >>>> + >>>> + crc = ~0; >>>> + crc_stored = le32_to_cpu(e->checksum); >>>> + >>>> + /* read parial parity for this entry and calculate its checksum */ >>>> + while (pp_size) { >>>> + int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size; >>>> + >>>> + if (!sync_page_io(rdev, sector - rdev->data_offset, >>>> + s, page, REQ_OP_READ, 0, false)) { >>>> + md_error(mddev, rdev); >>>> + ret = -EIO; >>>> + goto out; >>>> + } >>>> + >>>> + crc = crc32c_le(crc, page_address(page), s); >>>> + >>>> + pp_size -= s; >>>> + sector += s >> 9; >>>> + } >>>> + >>>> + crc = ~crc; >>>> + >>>> + if (crc != crc_stored) { >>>> + /* >>>> + * Don't recover this entry if the checksum does not >>>> + * match, but keep going and try to recover other >>>> + * entries. >>>> + */ >>>> + pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n", >>>> + __func__, crc_stored, crc); >>>> + ppl_conf->mismatch_count++; >>>> + } else { >>>> + ret = ppl_recover_entry(log, e, ppl_sector); >>>> + if (ret) >>>> + goto out; >>>> + ppl_conf->recovered_entries++; >>>> + } >>>> + >>>> + ppl_sector += ppl_entry_sectors; >>>> + } >>>> +out: >>>> + __free_page(page); >>>> + return ret; >>>> +} >>>> + >>>> +static int ppl_write_empty_header(struct ppl_log *log) >>>> +{ >>>> + struct page *page; >>>> + struct ppl_header *pplhdr; >>>> + struct md_rdev *rdev = log->rdev; >>>> + int ret = 0; >>>> + >>>> + pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__, >>>> + rdev->raid_disk, (unsigned long long)rdev->ppl.sector); >>>> + >>>> + page = alloc_page(GFP_NOIO | __GFP_ZERO); >>>> + if (!page) >>>> + return -ENOMEM; >>>> + >>>> + pplhdr = page_address(page); >>>> + memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED); >>>> + pplhdr->signature = cpu_to_le32(log->ppl_conf->signature); >>>> + pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE)); >>>> + >>>> + if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset, >>>> + PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) { >>> >>> write with FUA? otherwise, the empty header might not set down in the disks. >> >> OK. >> >>>> + md_error(rdev->mddev, rdev); >>>> + ret = -EIO; >>>> + } >>>> + >>>> + __free_page(page); >>>> + return ret; >>>> +} >>>> + >>>> +static int ppl_load_distributed(struct ppl_log *log) >>>> +{ >>>> + struct ppl_conf *ppl_conf = log->ppl_conf; >>>> + struct md_rdev *rdev = log->rdev; >>>> + struct mddev *mddev = rdev->mddev; >>>> + struct page *page; >>>> + struct ppl_header *pplhdr; >>>> + u32 crc, crc_stored; >>>> + u32 signature; >>>> + int ret = 0; >>>> + >>>> + pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk); >>>> + >>>> + /* read PPL header */ >>>> + page = alloc_page(GFP_KERNEL); >>>> + if (!page) >>>> + return -ENOMEM; >>>> + >>>> + if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset, >>>> + PAGE_SIZE, page, REQ_OP_READ, 0, false)) { >>>> + md_error(mddev, rdev); >>>> + ret = -EIO; >>>> + goto out; >>>> + } >>>> + pplhdr = page_address(page); >>>> + >>>> + /* check header validity */ >>>> + crc_stored = le32_to_cpu(pplhdr->checksum); >>>> + pplhdr->checksum = 0; >>>> + crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE); >>>> + >>>> + if (crc_stored != crc) { >>>> + pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n", >>>> + __func__, crc_stored, crc); >>>> + ppl_conf->mismatch_count++; >>>> + goto out; >>>> + } >>>> + >>>> + signature = le32_to_cpu(pplhdr->signature); >>>> + >>>> + if (mddev->external) { >>>> + /* >>>> + * For external metadata the header signature is set and >>>> + * validated in userspace. >>>> + */ >>>> + ppl_conf->signature = signature; >>>> + } else if (ppl_conf->signature != signature) { >>>> + pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n", >>>> + __func__, signature, ppl_conf->signature); >>>> + ppl_conf->mismatch_count++; >>>> + goto out; >>>> + } >>>> + >>>> + /* attempt to recover from log if we are starting a dirty array */ >>>> + if (!mddev->pers && mddev->recovery_cp != MaxSector) >>>> + ret = ppl_recover(log, pplhdr); >>> >>> when could mddev->pers be null? >> >> When the array is inactive - we are starting the array with ppl, not >> enabling ppl at runtime. > > Didn't follow this. Can explain more? mddev->pers is null if this is executed from md_run()/raid5_run(). That's when the array is just about to be started and we want to perform the ppl recovery. If this is called from raid5_change_consistency_policy(), mddev->pers is not null and recovery should not be attempted, because the array is active. Thanks, Artur -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html