On Wed, Mar 01, 2017 at 05:14:18PM +0100, Artur Paszkiewicz wrote: > On 02/28/2017 02:12 AM, Shaohua Li wrote: > > On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote: > >> Load the log from each disk when starting the array and recover if the > >> array is dirty. > >> > >> The initial empty PPL is written by mdadm. When loading the log we > >> verify the header checksum and signature. For external metadata arrays > >> the signature is verified in userspace, so here we read it from the > >> header, verifying only if it matches on all disks, and use it later when > >> writing PPL. > >> > >> In addition to the header checksum, each header entry also contains a > >> checksum of its partial parity data. If the header is valid, recovery is > >> performed for each entry until an invalid entry is found. If the array > >> is not degraded and recovery using PPL fully succeeds, there is no need > >> to resync the array because data and parity will be consistent, so in > >> this case resync will be disabled. > >> > >> Due to compatibility with IMSM implementations on other systems, we > >> can't assume that the recovery data block size is always 4K. Writes > >> generated by MD raid5 don't have this issue, but when recovering PPL > >> written in other environments it is possible to have entries with > >> 512-byte sector granularity. The recovery code takes this into account > >> and also the logical sector size of the underlying drives. > >> > >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> > >> --- > >> drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++ > >> drivers/md/raid5.c | 5 +- > >> 2 files changed, 487 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c > >> index a00cabf1adf6..a7693353243a 100644 > >> --- a/drivers/md/raid5-ppl.c > >> +++ b/drivers/md/raid5-ppl.c > >> @@ -16,6 +16,7 @@ > >> #include <linux/blkdev.h> > >> #include <linux/slab.h> > >> #include <linux/crc32c.h> > >> +#include <linux/async_tx.h> > >> #include <linux/raid/md_p.h> > >> #include "md.h" > >> #include "raid5.h" > >> @@ -84,6 +85,10 @@ struct ppl_conf { > >> mempool_t *io_pool; > >> struct bio_set *bs; > >> mempool_t *meta_pool; > >> + > >> + /* used only for recovery */ > >> + int recovered_entries; > >> + int mismatch_count; > >> }; > >> > >> struct ppl_log { > >> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh) > >> ppl_io_unit_finished(io); > >> } > >> > >> +static void ppl_xor(int size, struct page *page1, struct page *page2, > >> + struct page *page_result) > >> +{ > >> + struct async_submit_ctl submit; > >> + struct dma_async_tx_descriptor *tx; > >> + struct page *xor_srcs[] = { page1, page2 }; > >> + > >> + init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST, > >> + NULL, NULL, NULL, NULL); > >> + tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit); > >> + > >> + async_tx_quiesce(&tx); > >> +} > >> + > >> +/* > >> + * PPL recovery strategy: xor partial parity and data from all modified data > >> + * disks within a stripe and write the result as the new stripe parity. If all > >> + * stripe data disks are modified (full stripe write), no partial parity is > >> + * available, so just xor the data disks. > >> + * > >> + * Recovery of a PPL entry shall occur only if all modified data disks are > >> + * available and read from all of them succeeds. > >> + * > >> + * A PPL entry applies to a stripe, partial parity size for an entry is at most > >> + * the size of the chunk. Examples of possible cases for a single entry: > >> + * > >> + * case 0: single data disk write: > >> + * data0 data1 data2 ppl parity > >> + * +--------+--------+--------+ +--------------------+ > >> + * | ------ | ------ | ------ | +----+ | (no change) | > >> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp | > >> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp | > >> + * | ------ | ------ | ------ | +----+ | (no change) | > >> + * +--------+--------+--------+ +--------------------+ > >> + * pp_size = data_size > >> + * > >> + * case 1: more than one data disk write: > >> + * data0 data1 data2 ppl parity > >> + * +--------+--------+--------+ +--------------------+ > >> + * | ------ | ------ | ------ | +----+ | (no change) | > >> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp | > >> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp | > >> + * | ------ | ------ | ------ | +----+ | (no change) | > >> + * +--------+--------+--------+ +--------------------+ > >> + * pp_size = data_size / modified_data_disks > >> + * > >> + * case 2: write to all data disks (also full stripe write): > >> + * data0 data1 data2 parity > >> + * +--------+--------+--------+ +--------------------+ > >> + * | ------ | ------ | ------ | | (no change) | > >> + * | -data- | -data- | -data- | --------> | xor all data | > >> + * | ------ | ------ | ------ | --------> | (no change) | > >> + * | ------ | ------ | ------ | | (no change) | > >> + * +--------+--------+--------+ +--------------------+ > >> + * pp_size = 0 > >> + * > >> + * The following cases are possible only in other implementations. The recovery > >> + * code can handle them, but they are not generated at runtime because they can > >> + * be reduced to cases 0, 1 and 2: > >> + * > >> + * case 3: > >> + * data0 data1 data2 ppl parity > >> + * +--------+--------+--------+ +----+ +--------------------+ > >> + * | ------ | -data- | -data- | | pp | | data1 ^ data2 ^ pp | > >> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp | > >> + * | -data- | -data- | -data- | | -- | -> | xor all data | > >> + * | -data- | -data- | ------ | | pp | | data0 ^ data1 ^ pp | > >> + * +--------+--------+--------+ +----+ +--------------------+ > >> + * pp_size = chunk_size > >> + * > >> + * case 4: > >> + * data0 data1 data2 ppl parity > >> + * +--------+--------+--------+ +----+ +--------------------+ > >> + * | ------ | -data- | ------ | | pp | | data1 ^ pp | > >> + * | ------ | ------ | ------ | | -- | -> | (no change) | > >> + * | ------ | ------ | ------ | | -- | -> | (no change) | > >> + * | -data- | ------ | ------ | | pp | | data0 ^ pp | > >> + * +--------+--------+--------+ +----+ +--------------------+ > >> + * pp_size = chunk_size > >> + */ > >> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e, > >> + sector_t ppl_sector) > >> +{ > >> + struct ppl_conf *ppl_conf = log->ppl_conf; > >> + struct mddev *mddev = ppl_conf->mddev; > >> + struct r5conf *conf = mddev->private; > >> + int block_size = ppl_conf->block_size; > >> + struct page *pages; > >> + struct page *page1; > >> + struct page *page2; > >> + sector_t r_sector_first; > >> + sector_t r_sector_last; > >> + int strip_sectors; > >> + int data_disks; > >> + int i; > >> + int ret = 0; > >> + char b[BDEVNAME_SIZE]; > >> + unsigned int pp_size = le32_to_cpu(e->pp_size); > >> + unsigned int data_size = le32_to_cpu(e->data_size); > >> + > >> + r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9); > >> + > >> + if ((pp_size >> 9) < conf->chunk_sectors) { > >> + if (pp_size > 0) { > >> + data_disks = data_size / pp_size; > >> + strip_sectors = pp_size >> 9; > >> + } else { > >> + data_disks = conf->raid_disks - conf->max_degraded; > >> + strip_sectors = (data_size >> 9) / data_disks; > >> + } > >> + r_sector_last = r_sector_first + > >> + (data_disks - 1) * conf->chunk_sectors + > >> + strip_sectors; > >> + } else { > >> + data_disks = conf->raid_disks - conf->max_degraded; > >> + strip_sectors = conf->chunk_sectors; > >> + r_sector_last = r_sector_first + (data_size >> 9); > >> + } > >> + > >> + pages = alloc_pages(GFP_KERNEL, 1); > > > > order != 0 allocation should be avoid if possible. Here sounds not necessary > > OK, I will allocate and free them separately. > > >> + if (!pages) > >> + return -ENOMEM; > >> + page1 = pages; > >> + page2 = pages + 1; > >> + > >> + pr_debug("%s: array sector first: %llu last: %llu\n", __func__, > >> + (unsigned long long)r_sector_first, > >> + (unsigned long long)r_sector_last); > >> + > >> + /* if start and end is 4k aligned, use a 4k block */ > >> + if (block_size == 512 && > >> + (r_sector_first & (STRIPE_SECTORS - 1)) == 0 && > >> + (r_sector_last & (STRIPE_SECTORS - 1)) == 0) > >> + block_size = STRIPE_SIZE; > >> + > >> + /* iterate through blocks in strip */ > >> + for (i = 0; i < strip_sectors; i += (block_size >> 9)) { > >> + bool update_parity = false; > >> + sector_t parity_sector; > >> + struct md_rdev *parity_rdev; > >> + struct stripe_head sh; > >> + int disk; > >> + int indent = 0; > >> + > >> + pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i); > >> + indent += 2; > >> + > >> + memset(page_address(page1), 0, PAGE_SIZE); > >> + > >> + /* iterate through data member disks */ > >> + for (disk = 0; disk < data_disks; disk++) { > >> + int dd_idx; > >> + struct md_rdev *rdev; > >> + sector_t sector; > >> + sector_t r_sector = r_sector_first + i + > >> + (disk * conf->chunk_sectors); > >> + > >> + pr_debug("%s:%*s data member disk %d start\n", > >> + __func__, indent, "", disk); > >> + indent += 2; > >> + > >> + if (r_sector >= r_sector_last) { > >> + pr_debug("%s:%*s array sector %llu doesn't need parity update\n", > >> + __func__, indent, "", > >> + (unsigned long long)r_sector); > >> + indent -= 2; > >> + continue; > >> + } > >> + > >> + update_parity = true; > >> + > >> + /* map raid sector to member disk */ > >> + sector = raid5_compute_sector(conf, r_sector, 0, > >> + &dd_idx, NULL); > >> + pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n", > >> + __func__, indent, "", > >> + (unsigned long long)r_sector, dd_idx, > >> + (unsigned long long)sector); > >> + > >> + rdev = conf->disks[dd_idx].rdev; > >> + if (!rdev) { > >> + pr_debug("%s:%*s data member disk %d missing\n", > >> + __func__, indent, "", dd_idx); > >> + update_parity = false; > >> + break; > >> + } > >> + > >> + pr_debug("%s:%*s reading data member disk %s sector %llu\n", > >> + __func__, indent, "", bdevname(rdev->bdev, b), > >> + (unsigned long long)sector); > >> + if (!sync_page_io(rdev, sector, block_size, page2, > >> + REQ_OP_READ, 0, false)) { > >> + md_error(mddev, rdev); > >> + pr_debug("%s:%*s read failed!\n", __func__, > >> + indent, ""); > >> + ret = -EIO; > >> + goto out; > >> + } > >> + > >> + ppl_xor(block_size, page1, page2, page1); > >> + > >> + indent -= 2; > >> + } > >> + > >> + if (!update_parity) > >> + continue; > >> + > >> + if (pp_size > 0) { > >> + pr_debug("%s:%*s reading pp disk sector %llu\n", > >> + __func__, indent, "", > >> + (unsigned long long)(ppl_sector + i)); > >> + if (!sync_page_io(log->rdev, > >> + ppl_sector - log->rdev->data_offset + i, > >> + block_size, page2, REQ_OP_READ, 0, > >> + false)) { > >> + pr_debug("%s:%*s read failed!\n", __func__, > >> + indent, ""); > >> + md_error(mddev, log->rdev); > >> + ret = -EIO; > >> + goto out; > >> + } > >> + > >> + ppl_xor(block_size, page1, page2, page1); > >> + } > >> + > >> + /* map raid sector to parity disk */ > >> + parity_sector = raid5_compute_sector(conf, r_sector_first + i, > >> + 0, &disk, &sh); > >> + BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk)); > >> + parity_rdev = conf->disks[sh.pd_idx].rdev; > >> + > >> + BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev); > >> + pr_debug("%s:%*s write parity at sector %llu, disk %s\n", > >> + __func__, indent, "", > >> + (unsigned long long)parity_sector, > >> + bdevname(parity_rdev->bdev, b)); > >> + if (!sync_page_io(parity_rdev, parity_sector, block_size, > >> + page1, REQ_OP_WRITE, 0, false)) { > >> + pr_debug("%s:%*s parity write error!\n", __func__, > >> + indent, ""); > >> + md_error(mddev, parity_rdev); > >> + ret = -EIO; > >> + goto out; > >> + } > >> + } > >> +out: > >> + __free_pages(pages, 1); > >> + return ret; > >> +} > >> + > >> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr) > >> +{ > >> + struct ppl_conf *ppl_conf = log->ppl_conf; > >> + struct md_rdev *rdev = log->rdev; > >> + struct mddev *mddev = rdev->mddev; > >> + sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9); > >> + struct page *page; > >> + int i; > >> + int ret = 0; > >> + > >> + page = alloc_page(GFP_KERNEL); > >> + if (!page) > >> + return -ENOMEM; > >> + > >> + /* iterate through all PPL entries saved */ > >> + for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) { > >> + struct ppl_header_entry *e = &pplhdr->entries[i]; > >> + u32 pp_size = le32_to_cpu(e->pp_size); > >> + sector_t sector = ppl_sector; > >> + int ppl_entry_sectors = pp_size >> 9; > >> + u32 crc, crc_stored; > >> + > >> + pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n", > >> + __func__, rdev->raid_disk, i, > >> + (unsigned long long)ppl_sector, pp_size); > >> + > >> + crc = ~0; > >> + crc_stored = le32_to_cpu(e->checksum); > >> + > >> + /* read parial parity for this entry and calculate its checksum */ > >> + while (pp_size) { > >> + int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size; > >> + > >> + if (!sync_page_io(rdev, sector - rdev->data_offset, > >> + s, page, REQ_OP_READ, 0, false)) { > >> + md_error(mddev, rdev); > >> + ret = -EIO; > >> + goto out; > >> + } > >> + > >> + crc = crc32c_le(crc, page_address(page), s); > >> + > >> + pp_size -= s; > >> + sector += s >> 9; > >> + } > >> + > >> + crc = ~crc; > >> + > >> + if (crc != crc_stored) { > >> + /* > >> + * Don't recover this entry if the checksum does not > >> + * match, but keep going and try to recover other > >> + * entries. > >> + */ > >> + pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n", > >> + __func__, crc_stored, crc); > >> + ppl_conf->mismatch_count++; > >> + } else { > >> + ret = ppl_recover_entry(log, e, ppl_sector); > >> + if (ret) > >> + goto out; > >> + ppl_conf->recovered_entries++; > >> + } > >> + > >> + ppl_sector += ppl_entry_sectors; > >> + } > >> +out: > >> + __free_page(page); > >> + return ret; > >> +} > >> + > >> +static int ppl_write_empty_header(struct ppl_log *log) > >> +{ > >> + struct page *page; > >> + struct ppl_header *pplhdr; > >> + struct md_rdev *rdev = log->rdev; > >> + int ret = 0; > >> + > >> + pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__, > >> + rdev->raid_disk, (unsigned long long)rdev->ppl.sector); > >> + > >> + page = alloc_page(GFP_NOIO | __GFP_ZERO); > >> + if (!page) > >> + return -ENOMEM; > >> + > >> + pplhdr = page_address(page); > >> + memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED); > >> + pplhdr->signature = cpu_to_le32(log->ppl_conf->signature); > >> + pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE)); > >> + > >> + if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset, > >> + PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) { > > > > write with FUA? otherwise, the empty header might not set down in the disks. > > OK. > > >> + md_error(rdev->mddev, rdev); > >> + ret = -EIO; > >> + } > >> + > >> + __free_page(page); > >> + return ret; > >> +} > >> + > >> +static int ppl_load_distributed(struct ppl_log *log) > >> +{ > >> + struct ppl_conf *ppl_conf = log->ppl_conf; > >> + struct md_rdev *rdev = log->rdev; > >> + struct mddev *mddev = rdev->mddev; > >> + struct page *page; > >> + struct ppl_header *pplhdr; > >> + u32 crc, crc_stored; > >> + u32 signature; > >> + int ret = 0; > >> + > >> + pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk); > >> + > >> + /* read PPL header */ > >> + page = alloc_page(GFP_KERNEL); > >> + if (!page) > >> + return -ENOMEM; > >> + > >> + if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset, > >> + PAGE_SIZE, page, REQ_OP_READ, 0, false)) { > >> + md_error(mddev, rdev); > >> + ret = -EIO; > >> + goto out; > >> + } > >> + pplhdr = page_address(page); > >> + > >> + /* check header validity */ > >> + crc_stored = le32_to_cpu(pplhdr->checksum); > >> + pplhdr->checksum = 0; > >> + crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE); > >> + > >> + if (crc_stored != crc) { > >> + pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n", > >> + __func__, crc_stored, crc); > >> + ppl_conf->mismatch_count++; > >> + goto out; > >> + } > >> + > >> + signature = le32_to_cpu(pplhdr->signature); > >> + > >> + if (mddev->external) { > >> + /* > >> + * For external metadata the header signature is set and > >> + * validated in userspace. > >> + */ > >> + ppl_conf->signature = signature; > >> + } else if (ppl_conf->signature != signature) { > >> + pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n", > >> + __func__, signature, ppl_conf->signature); > >> + ppl_conf->mismatch_count++; > >> + goto out; > >> + } > >> + > >> + /* attempt to recover from log if we are starting a dirty array */ > >> + if (!mddev->pers && mddev->recovery_cp != MaxSector) > >> + ret = ppl_recover(log, pplhdr); > > > > when could mddev->pers be null? > > When the array is inactive - we are starting the array with ppl, not > enabling ppl at runtime. Didn't follow this. Can explain more? Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html