On 02/28/2017 02:12 AM, Shaohua Li wrote: > On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote: >> Load the log from each disk when starting the array and recover if the >> array is dirty. >> >> The initial empty PPL is written by mdadm. When loading the log we >> verify the header checksum and signature. For external metadata arrays >> the signature is verified in userspace, so here we read it from the >> header, verifying only if it matches on all disks, and use it later when >> writing PPL. >> >> In addition to the header checksum, each header entry also contains a >> checksum of its partial parity data. If the header is valid, recovery is >> performed for each entry until an invalid entry is found. If the array >> is not degraded and recovery using PPL fully succeeds, there is no need >> to resync the array because data and parity will be consistent, so in >> this case resync will be disabled. >> >> Due to compatibility with IMSM implementations on other systems, we >> can't assume that the recovery data block size is always 4K. Writes >> generated by MD raid5 don't have this issue, but when recovering PPL >> written in other environments it is possible to have entries with >> 512-byte sector granularity. The recovery code takes this into account >> and also the logical sector size of the underlying drives. >> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> >> --- >> drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++ >> drivers/md/raid5.c | 5 +- >> 2 files changed, 487 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c >> index a00cabf1adf6..a7693353243a 100644 >> --- a/drivers/md/raid5-ppl.c >> +++ b/drivers/md/raid5-ppl.c >> @@ -16,6 +16,7 @@ >> #include <linux/blkdev.h> >> #include <linux/slab.h> >> #include <linux/crc32c.h> >> +#include <linux/async_tx.h> >> #include <linux/raid/md_p.h> >> #include "md.h" >> #include "raid5.h" >> @@ -84,6 +85,10 @@ struct ppl_conf { >> mempool_t *io_pool; >> struct bio_set *bs; >> mempool_t *meta_pool; >> + >> + /* used only for recovery */ >> + int recovered_entries; >> + int mismatch_count; >> }; >> >> struct ppl_log { >> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh) >> ppl_io_unit_finished(io); >> } >> >> +static void ppl_xor(int size, struct page *page1, struct page *page2, >> + struct page *page_result) >> +{ >> + struct async_submit_ctl submit; >> + struct dma_async_tx_descriptor *tx; >> + struct page *xor_srcs[] = { page1, page2 }; >> + >> + init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST, >> + NULL, NULL, NULL, NULL); >> + tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit); >> + >> + async_tx_quiesce(&tx); >> +} >> + >> +/* >> + * PPL recovery strategy: xor partial parity and data from all modified data >> + * disks within a stripe and write the result as the new stripe parity. If all >> + * stripe data disks are modified (full stripe write), no partial parity is >> + * available, so just xor the data disks. >> + * >> + * Recovery of a PPL entry shall occur only if all modified data disks are >> + * available and read from all of them succeeds. >> + * >> + * A PPL entry applies to a stripe, partial parity size for an entry is at most >> + * the size of the chunk. Examples of possible cases for a single entry: >> + * >> + * case 0: single data disk write: >> + * data0 data1 data2 ppl parity >> + * +--------+--------+--------+ +--------------------+ >> + * | ------ | ------ | ------ | +----+ | (no change) | >> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp | >> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp | >> + * | ------ | ------ | ------ | +----+ | (no change) | >> + * +--------+--------+--------+ +--------------------+ >> + * pp_size = data_size >> + * >> + * case 1: more than one data disk write: >> + * data0 data1 data2 ppl parity >> + * +--------+--------+--------+ +--------------------+ >> + * | ------ | ------ | ------ | +----+ | (no change) | >> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp | >> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp | >> + * | ------ | ------ | ------ | +----+ | (no change) | >> + * +--------+--------+--------+ +--------------------+ >> + * pp_size = data_size / modified_data_disks >> + * >> + * case 2: write to all data disks (also full stripe write): >> + * data0 data1 data2 parity >> + * +--------+--------+--------+ +--------------------+ >> + * | ------ | ------ | ------ | | (no change) | >> + * | -data- | -data- | -data- | --------> | xor all data | >> + * | ------ | ------ | ------ | --------> | (no change) | >> + * | ------ | ------ | ------ | | (no change) | >> + * +--------+--------+--------+ +--------------------+ >> + * pp_size = 0 >> + * >> + * The following cases are possible only in other implementations. The recovery >> + * code can handle them, but they are not generated at runtime because they can >> + * be reduced to cases 0, 1 and 2: >> + * >> + * case 3: >> + * data0 data1 data2 ppl parity >> + * +--------+--------+--------+ +----+ +--------------------+ >> + * | ------ | -data- | -data- | | pp | | data1 ^ data2 ^ pp | >> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp | >> + * | -data- | -data- | -data- | | -- | -> | xor all data | >> + * | -data- | -data- | ------ | | pp | | data0 ^ data1 ^ pp | >> + * +--------+--------+--------+ +----+ +--------------------+ >> + * pp_size = chunk_size >> + * >> + * case 4: >> + * data0 data1 data2 ppl parity >> + * +--------+--------+--------+ +----+ +--------------------+ >> + * | ------ | -data- | ------ | | pp | | data1 ^ pp | >> + * | ------ | ------ | ------ | | -- | -> | (no change) | >> + * | ------ | ------ | ------ | | -- | -> | (no change) | >> + * | -data- | ------ | ------ | | pp | | data0 ^ pp | >> + * +--------+--------+--------+ +----+ +--------------------+ >> + * pp_size = chunk_size >> + */ >> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e, >> + sector_t ppl_sector) >> +{ >> + struct ppl_conf *ppl_conf = log->ppl_conf; >> + struct mddev *mddev = ppl_conf->mddev; >> + struct r5conf *conf = mddev->private; >> + int block_size = ppl_conf->block_size; >> + struct page *pages; >> + struct page *page1; >> + struct page *page2; >> + sector_t r_sector_first; >> + sector_t r_sector_last; >> + int strip_sectors; >> + int data_disks; >> + int i; >> + int ret = 0; >> + char b[BDEVNAME_SIZE]; >> + unsigned int pp_size = le32_to_cpu(e->pp_size); >> + unsigned int data_size = le32_to_cpu(e->data_size); >> + >> + r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9); >> + >> + if ((pp_size >> 9) < conf->chunk_sectors) { >> + if (pp_size > 0) { >> + data_disks = data_size / pp_size; >> + strip_sectors = pp_size >> 9; >> + } else { >> + data_disks = conf->raid_disks - conf->max_degraded; >> + strip_sectors = (data_size >> 9) / data_disks; >> + } >> + r_sector_last = r_sector_first + >> + (data_disks - 1) * conf->chunk_sectors + >> + strip_sectors; >> + } else { >> + data_disks = conf->raid_disks - conf->max_degraded; >> + strip_sectors = conf->chunk_sectors; >> + r_sector_last = r_sector_first + (data_size >> 9); >> + } >> + >> + pages = alloc_pages(GFP_KERNEL, 1); > > order != 0 allocation should be avoid if possible. Here sounds not necessary OK, I will allocate and free them separately. >> + if (!pages) >> + return -ENOMEM; >> + page1 = pages; >> + page2 = pages + 1; >> + >> + pr_debug("%s: array sector first: %llu last: %llu\n", __func__, >> + (unsigned long long)r_sector_first, >> + (unsigned long long)r_sector_last); >> + >> + /* if start and end is 4k aligned, use a 4k block */ >> + if (block_size == 512 && >> + (r_sector_first & (STRIPE_SECTORS - 1)) == 0 && >> + (r_sector_last & (STRIPE_SECTORS - 1)) == 0) >> + block_size = STRIPE_SIZE; >> + >> + /* iterate through blocks in strip */ >> + for (i = 0; i < strip_sectors; i += (block_size >> 9)) { >> + bool update_parity = false; >> + sector_t parity_sector; >> + struct md_rdev *parity_rdev; >> + struct stripe_head sh; >> + int disk; >> + int indent = 0; >> + >> + pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i); >> + indent += 2; >> + >> + memset(page_address(page1), 0, PAGE_SIZE); >> + >> + /* iterate through data member disks */ >> + for (disk = 0; disk < data_disks; disk++) { >> + int dd_idx; >> + struct md_rdev *rdev; >> + sector_t sector; >> + sector_t r_sector = r_sector_first + i + >> + (disk * conf->chunk_sectors); >> + >> + pr_debug("%s:%*s data member disk %d start\n", >> + __func__, indent, "", disk); >> + indent += 2; >> + >> + if (r_sector >= r_sector_last) { >> + pr_debug("%s:%*s array sector %llu doesn't need parity update\n", >> + __func__, indent, "", >> + (unsigned long long)r_sector); >> + indent -= 2; >> + continue; >> + } >> + >> + update_parity = true; >> + >> + /* map raid sector to member disk */ >> + sector = raid5_compute_sector(conf, r_sector, 0, >> + &dd_idx, NULL); >> + pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n", >> + __func__, indent, "", >> + (unsigned long long)r_sector, dd_idx, >> + (unsigned long long)sector); >> + >> + rdev = conf->disks[dd_idx].rdev; >> + if (!rdev) { >> + pr_debug("%s:%*s data member disk %d missing\n", >> + __func__, indent, "", dd_idx); >> + update_parity = false; >> + break; >> + } >> + >> + pr_debug("%s:%*s reading data member disk %s sector %llu\n", >> + __func__, indent, "", bdevname(rdev->bdev, b), >> + (unsigned long long)sector); >> + if (!sync_page_io(rdev, sector, block_size, page2, >> + REQ_OP_READ, 0, false)) { >> + md_error(mddev, rdev); >> + pr_debug("%s:%*s read failed!\n", __func__, >> + indent, ""); >> + ret = -EIO; >> + goto out; >> + } >> + >> + ppl_xor(block_size, page1, page2, page1); >> + >> + indent -= 2; >> + } >> + >> + if (!update_parity) >> + continue; >> + >> + if (pp_size > 0) { >> + pr_debug("%s:%*s reading pp disk sector %llu\n", >> + __func__, indent, "", >> + (unsigned long long)(ppl_sector + i)); >> + if (!sync_page_io(log->rdev, >> + ppl_sector - log->rdev->data_offset + i, >> + block_size, page2, REQ_OP_READ, 0, >> + false)) { >> + pr_debug("%s:%*s read failed!\n", __func__, >> + indent, ""); >> + md_error(mddev, log->rdev); >> + ret = -EIO; >> + goto out; >> + } >> + >> + ppl_xor(block_size, page1, page2, page1); >> + } >> + >> + /* map raid sector to parity disk */ >> + parity_sector = raid5_compute_sector(conf, r_sector_first + i, >> + 0, &disk, &sh); >> + BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk)); >> + parity_rdev = conf->disks[sh.pd_idx].rdev; >> + >> + BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev); >> + pr_debug("%s:%*s write parity at sector %llu, disk %s\n", >> + __func__, indent, "", >> + (unsigned long long)parity_sector, >> + bdevname(parity_rdev->bdev, b)); >> + if (!sync_page_io(parity_rdev, parity_sector, block_size, >> + page1, REQ_OP_WRITE, 0, false)) { >> + pr_debug("%s:%*s parity write error!\n", __func__, >> + indent, ""); >> + md_error(mddev, parity_rdev); >> + ret = -EIO; >> + goto out; >> + } >> + } >> +out: >> + __free_pages(pages, 1); >> + return ret; >> +} >> + >> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr) >> +{ >> + struct ppl_conf *ppl_conf = log->ppl_conf; >> + struct md_rdev *rdev = log->rdev; >> + struct mddev *mddev = rdev->mddev; >> + sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9); >> + struct page *page; >> + int i; >> + int ret = 0; >> + >> + page = alloc_page(GFP_KERNEL); >> + if (!page) >> + return -ENOMEM; >> + >> + /* iterate through all PPL entries saved */ >> + for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) { >> + struct ppl_header_entry *e = &pplhdr->entries[i]; >> + u32 pp_size = le32_to_cpu(e->pp_size); >> + sector_t sector = ppl_sector; >> + int ppl_entry_sectors = pp_size >> 9; >> + u32 crc, crc_stored; >> + >> + pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n", >> + __func__, rdev->raid_disk, i, >> + (unsigned long long)ppl_sector, pp_size); >> + >> + crc = ~0; >> + crc_stored = le32_to_cpu(e->checksum); >> + >> + /* read parial parity for this entry and calculate its checksum */ >> + while (pp_size) { >> + int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size; >> + >> + if (!sync_page_io(rdev, sector - rdev->data_offset, >> + s, page, REQ_OP_READ, 0, false)) { >> + md_error(mddev, rdev); >> + ret = -EIO; >> + goto out; >> + } >> + >> + crc = crc32c_le(crc, page_address(page), s); >> + >> + pp_size -= s; >> + sector += s >> 9; >> + } >> + >> + crc = ~crc; >> + >> + if (crc != crc_stored) { >> + /* >> + * Don't recover this entry if the checksum does not >> + * match, but keep going and try to recover other >> + * entries. >> + */ >> + pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n", >> + __func__, crc_stored, crc); >> + ppl_conf->mismatch_count++; >> + } else { >> + ret = ppl_recover_entry(log, e, ppl_sector); >> + if (ret) >> + goto out; >> + ppl_conf->recovered_entries++; >> + } >> + >> + ppl_sector += ppl_entry_sectors; >> + } >> +out: >> + __free_page(page); >> + return ret; >> +} >> + >> +static int ppl_write_empty_header(struct ppl_log *log) >> +{ >> + struct page *page; >> + struct ppl_header *pplhdr; >> + struct md_rdev *rdev = log->rdev; >> + int ret = 0; >> + >> + pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__, >> + rdev->raid_disk, (unsigned long long)rdev->ppl.sector); >> + >> + page = alloc_page(GFP_NOIO | __GFP_ZERO); >> + if (!page) >> + return -ENOMEM; >> + >> + pplhdr = page_address(page); >> + memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED); >> + pplhdr->signature = cpu_to_le32(log->ppl_conf->signature); >> + pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE)); >> + >> + if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset, >> + PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) { > > write with FUA? otherwise, the empty header might not set down in the disks. OK. >> + md_error(rdev->mddev, rdev); >> + ret = -EIO; >> + } >> + >> + __free_page(page); >> + return ret; >> +} >> + >> +static int ppl_load_distributed(struct ppl_log *log) >> +{ >> + struct ppl_conf *ppl_conf = log->ppl_conf; >> + struct md_rdev *rdev = log->rdev; >> + struct mddev *mddev = rdev->mddev; >> + struct page *page; >> + struct ppl_header *pplhdr; >> + u32 crc, crc_stored; >> + u32 signature; >> + int ret = 0; >> + >> + pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk); >> + >> + /* read PPL header */ >> + page = alloc_page(GFP_KERNEL); >> + if (!page) >> + return -ENOMEM; >> + >> + if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset, >> + PAGE_SIZE, page, REQ_OP_READ, 0, false)) { >> + md_error(mddev, rdev); >> + ret = -EIO; >> + goto out; >> + } >> + pplhdr = page_address(page); >> + >> + /* check header validity */ >> + crc_stored = le32_to_cpu(pplhdr->checksum); >> + pplhdr->checksum = 0; >> + crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE); >> + >> + if (crc_stored != crc) { >> + pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n", >> + __func__, crc_stored, crc); >> + ppl_conf->mismatch_count++; >> + goto out; >> + } >> + >> + signature = le32_to_cpu(pplhdr->signature); >> + >> + if (mddev->external) { >> + /* >> + * For external metadata the header signature is set and >> + * validated in userspace. >> + */ >> + ppl_conf->signature = signature; >> + } else if (ppl_conf->signature != signature) { >> + pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n", >> + __func__, signature, ppl_conf->signature); >> + ppl_conf->mismatch_count++; >> + goto out; >> + } >> + >> + /* attempt to recover from log if we are starting a dirty array */ >> + if (!mddev->pers && mddev->recovery_cp != MaxSector) >> + ret = ppl_recover(log, pplhdr); > > when could mddev->pers be null? When the array is inactive - we are starting the array with ppl, not enabling ppl at runtime. > we need flush disk cache here. I know you assume disks don't enable cache yet, > but it's simple to do a flush here. You are right, I will add it. Thanks, Artur -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html