Re: [PATCH v4 3/7] raid5-ppl: Partial Parity Log write logging implementation

Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx> · Wed, 1 Mar 2017 17:13:38 +0100

On 02/28/2017 01:04 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:43:57PM +0100, Artur Paszkiewicz wrote:
>> This implements the PPL write logging functionality. The description
>> of PPL is added to the documentation. More details can be found in the
>> comments in raid5-ppl.c.
>>
>> Put the PPL metadata structures to md_p.h because userspace tools
>> (mdadm) will also need to read/write PPL.
>>
>> Warn about using PPL with enabled disk volatile write-back cache for
>> now. It can be removed once disk cache flushing before writing PPL is
>> implemented.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@xxxxxxxxx>
>> ---
>>  Documentation/md/raid5-ppl.txt |  44 +++
>>  drivers/md/Makefile            |   2 +-
>>  drivers/md/raid5-ppl.c         | 617 +++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c             |  49 +++-
>>  drivers/md/raid5.h             |   8 +
>>  include/uapi/linux/raid/md_p.h |  26 ++
>>  6 files changed, 738 insertions(+), 8 deletions(-)
>>  create mode 100644 Documentation/md/raid5-ppl.txt
>>  create mode 100644 drivers/md/raid5-ppl.c
>>
>> diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt
>> new file mode 100644
>> index 000000000000..127072b09363
>> --- /dev/null
>> +++ b/Documentation/md/raid5-ppl.txt
>> @@ -0,0 +1,44 @@
>> +Partial Parity Log
>> +
>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>> +may become inconsistent with data on other member disks. If the array is also
>> +in degraded state, there is no way to recalculate parity, because one of the
>> +disks is missing. This can lead to silent data corruption when rebuilding the
>> +array or using it is as degraded - data calculated from parity for array blocks
>> +that have not been touched by a write request during the unclean shutdown can
>> +be incorrect. Such condition is known as the RAID5 Write Hole. Because of
>> +this, md by default does not allow starting a dirty degraded array.
>> +
>> +Partial parity for a write operation is the XOR of stripe data chunks not
>> +modified by this write. It is just enough data needed for recovering from the
>> +write hole. XORing partial parity with the modified chunks produces parity for
>> +the stripe, consistent with its state before the write operation, regardless of
>> +which chunk writes have completed. If one of the not modified data disks of
>> +this stripe is missing, this updated parity can be used to recover its
>> +contents. PPL recovery is also performed when starting an array after an
>> +unclean shutdown and all disks are available, eliminating the need to resync
>> +the array. Because of this, using write-intent bitmap and PPL together is not
>> +supported.
>> +
>> +When handling a write request PPL writes partial parity before new data and
>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>> +array member drives in the metadata area, on the parity drive of a particular
>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>> +and the journaling drive does not become a bottleneck or a single point of
>> +failure.
>> +
>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>> +not a true journal. It does not protect from losing in-flight data, only from
>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>> +performed for this stripe (parity is not updated). So it is possible to have
>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>> +case the behavior is the same as in plain raid5.
>> +
>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>> +metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
>> +
>> +Currently, volatile write-back cache should be disabled on all member drives
>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>> +failure.
>> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
>> index 3cbda1af87a0..4d48714ccc6b 100644
>> --- a/drivers/md/Makefile
>> +++ b/drivers/md/Makefile
>> @@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
>>  dm-era-y	+= dm-era-target.o
>>  dm-verity-y	+= dm-verity-target.o
>>  md-mod-y	+= md.o bitmap.o
>> -raid456-y	+= raid5.o raid5-cache.o
>> +raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
>>  
>>  # Note: link order is important.  All raid personalities
>>  # and must come before md.o, as they each initialise 
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> new file mode 100644
>> index 000000000000..a00cabf1adf6
>> --- /dev/null
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -0,0 +1,617 @@
>> +/*
>> + * Partial Parity Log for closing the RAID5 write hole
>> + * Copyright (c) 2017, Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/slab.h>
>> +#include <linux/crc32c.h>
>> +#include <linux/raid/md_p.h>
>> +#include "md.h"
>> +#include "raid5.h"
>> +
>> +/*
>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>> + * partial parity data. The header contains an array of entries
>> + * (struct ppl_header_entry) which describe the logged write requests.
>> + * Partial parity for the entries comes after the header, written in the same
>> + * sequence as the entries:
>> + *
>> + * Header
>> + *   entry0
>> + *   ...
>> + *   entryN
>> + * PP data
>> + *   PP for entry0
>> + *   ...
>> + *   PP for entryN
>> + *
>> + * Every entry holds a checksum of its partial parity, the header also has a
>> + * checksum of the header itself. Entries for full stripes writes contain no
>> + * partial parity, they only mark the stripes for which parity should be
>> + * recalculated after an unclean shutdown.
> 
> Please describe the disk format in details. I had trouble to understand
> ppl_log_stripe() and ppl_submit_iounit(), things like the ppl merge, full
> stripes handling.

OK, I will improve it, but more information about this is also in patch 5.

>> + *
>> + * A write request is always logged to the PPL instance stored on the parity
>> + * disk of the corresponding stripe. For each member disk there is one ppl_log
>> + * used to handle logging for this disk, independently from others. They are
>> + * grouped in child_logs array in struct ppl_conf, which is assigned to
>> + * r5conf->ppl and used in raid5 core.
>> + *
>> + * ppl_io_unit represents a full PPL write, header_page contains the ppl_header.
>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>> + * be appended to the last entry if the chunks to write are the same, otherwise
>> + * a new entry is added. Checksums of entries are calculated incrementally as
>> + * stripes containing partial parity are being added to entries.
>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>> + * containing the header page and partial parity pages (sh->ppl_page) for all
>> + * stripes of the io_unit. When the PPL write completes, the stripes associated
>> + * with the io_unit are released and raid5d starts writing their data and
>> + * parity. When all stripes are written, the io_unit is freed and the next can
>> + * be submitted.
>> + *
>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>> + * can't be submitted until the previous has completed (PPL and stripe
>> + * data+parity is written). The log->io_list tracks all io_units of a log
>> + * (for a single member disk). New io_units are added to the end of the list
>> + * and the first io_unit is submitted, if it is not submitted already.
>> + * The current io_unit accepting new stripes is always at the end of the list.
>> + */
>> +
>> +struct ppl_conf {
>> +	struct mddev *mddev;
>> +
>> +	/* array of child logs, one for each raid disk */
>> +	struct ppl_log *child_logs;
>> +	int count;
>> +
>> +	int block_size;		/* the logical block size used for data_sector
>> +				 * in ppl_header_entry */
>> +	u32 signature;		/* raid array identifier */
>> +	atomic64_t seq;		/* current log write sequence number */
>> +
>> +	struct kmem_cache *io_kc;
>> +	mempool_t *io_pool;
>> +	struct bio_set *bs;
>> +	mempool_t *meta_pool;
>> +};
>> +
>> +struct ppl_log {
>> +	struct ppl_conf *ppl_conf;	/* shared between all log instances */
>> +
>> +	struct md_rdev *rdev;		/* array member disk associated with
>> +					 * this log instance */
>> +	struct mutex io_mutex;
>> +	struct ppl_io_unit *current_io;	/* current io_unit accepting new data
>> +					 * always at the end of io_list */
>> +	spinlock_t io_list_lock;
>> +	struct list_head io_list;	/* all io_units of this log */
>> +	struct list_head no_mem_stripes;/* stripes to retry if failed to
>> +					 * allocate io_unit */
>> +};
>> +
>> +struct ppl_io_unit {
>> +	struct ppl_log *log;
>> +
>> +	struct page *header_page;	/* for ppl_header */
>> +
>> +	unsigned int entries_count;	/* number of entries in ppl_header */
>> +	unsigned int pp_size;		/* total size current of partial parity */
>> +
>> +	u64 seq;			/* sequence number of this log write */
>> +	struct list_head log_sibling;	/* log->io_list */
>> +
>> +	struct list_head stripe_list;	/* stripes added to the io_unit */
>> +	atomic_t pending_stripes;	/* how many stripes not written to raid */
>> +
>> +	bool submitted;			/* true if write to log started */
>> +};
>> +
>> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>> +					  struct stripe_head *sh)
>> +{
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct ppl_io_unit *io;
>> +	struct ppl_header *pplhdr;
>> +
>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
>> +	if (!io)
>> +		return NULL;
>> +
>> +	memset(io, 0, sizeof(*io));
>> +	io->log = log;
>> +	INIT_LIST_HEAD(&io->log_sibling);
>> +	INIT_LIST_HEAD(&io->stripe_list);
>> +	atomic_set(&io->pending_stripes, 0);
>> +
>> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
>> +	pplhdr = page_address(io->header_page);
>> +	clear_page(pplhdr);
>> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
>> +	pplhdr->signature = cpu_to_le32(ppl_conf->signature);
>> +
>> +	io->seq = atomic64_add_return(1, &ppl_conf->seq);
>> +	pplhdr->generation = cpu_to_le64(io->seq);
>> +
>> +	return io;
>> +}
>> +
>> +static int ppl_log_stripe(struct ppl_log *log, struct stripe_head *sh)
>> +{
>> +	struct ppl_io_unit *io = log->current_io;
>> +	struct ppl_header_entry *e = NULL;
>> +	struct ppl_header *pplhdr;
>> +	int i;
>> +	sector_t data_sector = 0;
>> +	int data_disks = 0;
>> +	unsigned int entry_space = (log->rdev->ppl.size << 9) - PPL_HEADER_SIZE;
>> +	struct r5conf *conf = sh->raid_conf;
>> +
>> +	pr_debug("%s: stripe: %llu\n", __func__, (unsigned long long)sh->sector);
>> +
>> +	/* check if current io_unit is full */
>> +	if (io && (io->pp_size == entry_space ||
>> +		   io->entries_count == PPL_HDR_MAX_ENTRIES)) {
>> +		pr_debug("%s: add io_unit blocked by seq: %llu\n",
>> +			 __func__, io->seq);
>> +		io = NULL;
>> +	}
>> +
>> +	/* add a new unit if there is none or the current is full */
>> +	if (!io) {
>> +		io = ppl_new_iounit(log, sh);
>> +		if (!io)
>> +			return -ENOMEM;
>> +		spin_lock_irq(&log->io_list_lock);
>> +		list_add_tail(&io->log_sibling, &log->io_list);
>> +		spin_unlock_irq(&log->io_list_lock);
>> +
>> +		log->current_io = io;
>> +	}
>> +
>> +	for (i = 0; i < sh->disks; i++) {
>> +		struct r5dev *dev = &sh->dev[i];
>> +
>> +		if (i != sh->pd_idx && test_bit(R5_Wantwrite, &dev->flags)) {
>> +			if (!data_disks || dev->sector < data_sector)
>> +				data_sector = dev->sector;
>> +			data_disks++;
>> +		}
>> +	}
>> +	BUG_ON(!data_disks);
>> +
>> +	pr_debug("%s: seq: %llu data_sector: %llu data_disks: %d\n", __func__,
>> +		 io->seq, (unsigned long long)data_sector, data_disks);
>> +
>> +	pplhdr = page_address(io->header_page);
>> +
>> +	if (io->entries_count > 0) {
>> +		struct ppl_header_entry *last =
>> +				&pplhdr->entries[io->entries_count - 1];
>> +		u64 data_sector_last = le64_to_cpu(last->data_sector);
>> +		u32 data_size_last = le32_to_cpu(last->data_size);
>> +		u32 pp_size_last = le32_to_cpu(last->pp_size);
>> +
>> +		/*
>> +		 * Check if we can merge with the last entry. Must be on
>> +		 * the same stripe and disks. Use bit shift and logarithm
>> +		 * to avoid 64-bit division.
>> +		 */
>> +		if ((data_sector >> ilog2(conf->chunk_sectors) ==
>> +		     data_sector_last >> ilog2(conf->chunk_sectors)) &&
>> +		    ((pp_size_last == 0 &&
>> +		      test_bit(STRIPE_FULL_WRITE, &sh->state)) ||
>> +		     ((data_sector_last + (pp_size_last >> 9) == data_sector) &&
>> +		      (data_size_last == pp_size_last * data_disks))))
> 
> please explain this part

This checks if the last and currently logged stripe_head are on the same
stripe and the disks to write are the same. The conditions are probably
more complicated than they should be, I'll try to simplify it.

>> +			e = last;
>> +	}
>> +
>> +	if (!e) {
>> +		e = &pplhdr->entries[io->entries_count++];
>> +		e->data_sector = cpu_to_le64(data_sector);
> 
> Don't really understand what's the data_sector for. Should this be stripe->sector?

This is the first raid sector of the ppl entry. It can be equal to
stripe->sector if the write starts at the first disk of the stripe. So
generally it is the sector of the first device in the stripe with
R5_Wantwrite of the first stripe in the ppl entry.

>> +		e->parity_disk = cpu_to_le32(sh->pd_idx);
>> +		e->checksum = cpu_to_le32(~0);
>> +	}
>> +
>> +	le32_add_cpu(&e->data_size, data_disks << PAGE_SHIFT);
>> +
>> +	/* don't write any PP if full stripe write */
>> +	if (!test_bit(STRIPE_FULL_WRITE, &sh->state)) {
>> +		le32_add_cpu(&e->pp_size, PAGE_SIZE);
>> +		io->pp_size += PAGE_SIZE;
>> +		e->checksum = cpu_to_le32(crc32c_le(le32_to_cpu(e->checksum),
>> +						    page_address(sh->ppl_page),
>> +						    PAGE_SIZE));
>> +	}
>> +
>> +	list_add_tail(&sh->log_list, &io->stripe_list);
>> +	atomic_inc(&io->pending_stripes);
>> +	sh->ppl_log_io = io;
>> +
>> +	return 0;
>> +}
>> +
>> +int ppl_write_stripe(struct ppl_conf *ppl_conf, struct stripe_head *sh)
>> +{
>> +	struct ppl_log *log;
>> +	struct ppl_io_unit *io = sh->ppl_log_io;
>> +
>> +	if (io || test_bit(STRIPE_SYNCING, &sh->state) ||
>> +	    !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
>> +	    !test_bit(R5_Insync, &sh->dev[sh->pd_idx].flags)) {
>> +		clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
>> +		return -EAGAIN;
>> +	}
>> +
>> +	log = &ppl_conf->child_logs[sh->pd_idx];
>> +
>> +	mutex_lock(&log->io_mutex);
>> +
>> +	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
> 
> is the log->rdev check required?

Yes, log->rdev can be null if the member drive for this stripe's parity
is missing. Patch 6 ("support disk hot add/remove with PPL") makes it
more apparent.

>> +		mutex_unlock(&log->io_mutex);
>> +		return -EAGAIN;
>> +	}
>> +
>> +	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
>> +	clear_bit(STRIPE_DELAYED, &sh->state);
>> +	atomic_inc(&sh->count);
>> +
>> +	if (ppl_log_stripe(log, sh)) {
>> +		spin_lock_irq(&log->io_list_lock);
>> +		list_add_tail(&sh->log_list, &log->no_mem_stripes);
>> +		spin_unlock_irq(&log->io_list_lock);
>> +	}
>> +
>> +	mutex_unlock(&log->io_mutex);
>> +
>> +	return 0;
>> +}
>> +
>> +static void ppl_log_endio(struct bio *bio)
>> +{
>> +	struct ppl_io_unit *io = bio->bi_private;
>> +	struct ppl_log *log = io->log;
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct stripe_head *sh, *next;
>> +
>> +	pr_debug("%s: seq: %llu\n", __func__, io->seq);
>> +
>> +	if (bio->bi_error)
>> +		md_error(ppl_conf->mddev, log->rdev);
>> +
>> +	bio_put(bio);
>> +	mempool_free(io->header_page, ppl_conf->meta_pool);
>> +
>> +	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
>> +		list_del_init(&sh->log_list);
>> +
>> +		set_bit(STRIPE_HANDLE, &sh->state);
>> +		raid5_release_stripe(sh);
>> +	}
>> +}
>> +
>> +static void ppl_submit_iounit(struct ppl_io_unit *io)
>> +{
>> +	struct ppl_log *log = io->log;
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct r5conf *conf = ppl_conf->mddev->private;
>> +	struct ppl_header *pplhdr = page_address(io->header_page);
>> +	struct bio *bio;
>> +	struct stripe_head *sh;
>> +	int i;
>> +	struct bio_list bios = BIO_EMPTY_LIST;
>> +	char b[BDEVNAME_SIZE];
>> +
>> +	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, ppl_conf->bs);
>> +	bio->bi_private = io;
>> +	bio->bi_end_io = ppl_log_endio;
>> +	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
>> +	bio->bi_bdev = log->rdev->bdev;
>> +	bio->bi_iter.bi_sector = log->rdev->ppl.sector;
>> +	bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
>> +	bio_list_add(&bios, bio);
>> +
>> +	sh = list_first_entry(&io->stripe_list, struct stripe_head, log_list);
>> +
>> +	for (i = 0; i < io->entries_count; i++) {
>> +		struct ppl_header_entry *e = &pplhdr->entries[i];
>> +		u32 pp_size = le32_to_cpu(e->pp_size);
>> +		u32 data_size = le32_to_cpu(e->data_size);
>> +		u64 data_sector = le64_to_cpu(e->data_sector);
>> +		int stripes_count;
>> +
>> +		if (pp_size > 0)
>> +			stripes_count = pp_size >> PAGE_SHIFT;
>> +		else
>> +			stripes_count = (data_size /
>> +					 (conf->raid_disks -
>> +					  conf->max_degraded)) >> PAGE_SHIFT;
> 
> please explain this part

The number of stripes of a ppl entry is equal to the size of its partial
parity divided by 4k, except when the stripes are STRIPE_FULL_WRITE -
then there is no partial parity, but the number of stripes is then equal
to the entry's data size / array data disks / 4k.

>> +
>> +		while (stripes_count--) {
>> +			/*
>> +			 * if entry without partial parity just skip its stripes
>> +			 * without adding pages to bio
>> +			 */
>> +			if (pp_size > 0 &&
>> +			    !bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0)) {
>> +				struct bio *prev = bio;
>> +
>> +				bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
>> +						       ppl_conf->bs);
> 
> The code keeps allocating bios but doesn't dispatch any yet. The bioset can't
> guarantee the allocation successes. To have the guarantee, we must make sure
> previous bios could finish.

Is it ok to submit the bio before allocating another from the same
bioset, without waiting for it to complete? The description for
bio_alloc_bioset() suggests this, if I understand it correctly: "Callers
that need to allocate more than 1 bio must always submit the previously
allocated bio for IO before attempting to allocate a new one." 

>> +				bio->bi_opf = prev->bi_opf;
>> +				bio->bi_bdev = prev->bi_bdev;
>> +				bio->bi_iter.bi_sector = bio_end_sector(prev);
>> +				bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
>> +				bio_chain(bio, prev);
>> +				bio_list_add(&bios, bio);
>> +			}
>> +			sh = list_next_entry(sh, log_list);
>> +		}
>> +
>> +		pr_debug("%s: seq: %llu entry: %d data_sector: %llu pp_size: %u data_size: %u\n",
>> +			 __func__, io->seq, i, data_sector, pp_size, data_size);
>> +
>> +		e->data_sector = cpu_to_le64(data_sector >>
>> +					     ilog2(ppl_conf->block_size >> 9));
>> +		e->checksum = cpu_to_le32(~le32_to_cpu(e->checksum));
>> +	}
>> +
>> +	pplhdr->entries_count = cpu_to_le32(io->entries_count);
>> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));
> 
> so the checksum is ~crc32c, right? please doc in ppl_header_entry definition

Ok, I will add it to the descriptions for both header and entry checksums.

>> +
>> +	while ((bio = bio_list_pop(&bios))) {
>> +		pr_debug("%s: seq: %llu submit_bio() size: %u sector: %llu dev: %s\n",
>> +			 __func__, io->seq, bio->bi_iter.bi_size,
>> +			 (unsigned long long)bio->bi_iter.bi_sector,
>> +			 bdevname(bio->bi_bdev, b));
>> +		submit_bio(bio);
>> +	}
>> +}
>> +
>> +static void ppl_submit_current_io(struct ppl_log *log)
>> +{
>> +	struct ppl_io_unit *io;
>> +
>> +	spin_lock_irq(&log->io_list_lock);
>> +
>> +	io = list_first_entry_or_null(&log->io_list, struct ppl_io_unit,
>> +				      log_sibling);
>> +	if (io && io->submitted)
>> +		io = NULL;
>> +
>> +	spin_unlock_irq(&log->io_list_lock);
>> +
>> +	if (io) {
>> +		io->submitted = true;
>> +
>> +		if (io == log->current_io)
>> +			log->current_io = NULL;
>> +
>> +		ppl_submit_iounit(io);
>> +	}
>> +}
>> +
>> +void ppl_write_stripe_run(struct ppl_conf *ppl_conf)
>> +{
>> +	struct ppl_log *log;
>> +	int i;
>> +
>> +	for (i = 0; i < ppl_conf->count; i++) {
>> +		log = &ppl_conf->child_logs[i];
>> +
>> +		mutex_lock(&log->io_mutex);
>> +		ppl_submit_current_io(log);
>> +		mutex_unlock(&log->io_mutex);
>> +	}
>> +}
>> +
>> +static void ppl_io_unit_finished(struct ppl_io_unit *io)
>> +{
>> +	struct ppl_log *log = io->log;
>> +	unsigned long flags;
>> +
>> +	pr_debug("%s: seq: %llu\n", __func__, io->seq);
>> +
>> +	spin_lock_irqsave(&log->io_list_lock, flags);
>> +
>> +	list_del(&io->log_sibling);
>> +	mempool_free(io, log->ppl_conf->io_pool);
>> +
>> +	if (!list_empty(&log->no_mem_stripes)) {
>> +		struct stripe_head *sh = list_first_entry(&log->no_mem_stripes,
>> +							  struct stripe_head,
>> +							  log_list);
>> +		list_del_init(&sh->log_list);
>> +		set_bit(STRIPE_HANDLE, &sh->state);
>> +		raid5_release_stripe(sh);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&log->io_list_lock, flags);
>> +}
>> +
>> +void ppl_stripe_write_finished(struct stripe_head *sh)
>> +{
>> +	struct ppl_io_unit *io;
>> +
>> +	io = sh->ppl_log_io;
>> +	sh->ppl_log_io = NULL;
>> +
>> +	if (io && atomic_dec_and_test(&io->pending_stripes))
>> +		ppl_io_unit_finished(io);
>> +}
>> +
>> +void ppl_exit_log(struct ppl_conf *ppl_conf)
>> +{
>> +	kfree(ppl_conf->child_logs);
>> +
>> +	mempool_destroy(ppl_conf->meta_pool);
>> +	if (ppl_conf->bs)
>> +		bioset_free(ppl_conf->bs);
>> +	mempool_destroy(ppl_conf->io_pool);
>> +	kmem_cache_destroy(ppl_conf->io_kc);
>> +
>> +	kfree(ppl_conf);
>> +}
>> +
>> +static int ppl_validate_rdev(struct md_rdev *rdev)
>> +{
>> +	char b[BDEVNAME_SIZE];
>> +	int ppl_data_sectors;
>> +	int ppl_size_new;
>> +
>> +	/*
>> +	 * The configured PPL size must be enough to store
>> +	 * the header and (at the very least) partial parity
>> +	 * for one stripe. Round it down to ensure the data
>> +	 * space is cleanly divisible by stripe size.
>> +	 */
>> +	ppl_data_sectors = rdev->ppl.size - (PPL_HEADER_SIZE >> 9);
>> +
>> +	if (ppl_data_sectors > 0)
>> +		ppl_data_sectors = rounddown(ppl_data_sectors, STRIPE_SECTORS);
>> +
>> +	if (ppl_data_sectors <= 0) {
>> +		pr_warn("md/raid:%s: PPL space too small on %s\n",
>> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
>> +		return -ENOSPC;
>> +	}
>> +
>> +	ppl_size_new = ppl_data_sectors + (PPL_HEADER_SIZE >> 9);
>> +
>> +	if ((rdev->ppl.sector < rdev->data_offset &&
>> +	     rdev->ppl.sector + ppl_size_new > rdev->data_offset) ||
>> +	    (rdev->ppl.sector >= rdev->data_offset &&
>> +	     rdev->data_offset + rdev->sectors > rdev->ppl.sector)) {
>> +		pr_warn("md/raid:%s: PPL space overlaps with data on %s\n",
>> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!rdev->mddev->external &&
>> +	    ((rdev->ppl.offset > 0 && rdev->ppl.offset < (rdev->sb_size >> 9)) ||
>> +	     (rdev->ppl.offset <= 0 && rdev->ppl.offset + ppl_size_new > 0))) {
>> +		pr_warn("md/raid:%s: PPL space overlaps with superblock on %s\n",
>> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
>> +		return -EINVAL;
>> +	}
>> +
>> +	rdev->ppl.size = ppl_size_new;
>> +
>> +	return 0;
>> +}
>> +
>> +int ppl_init_log(struct r5conf *conf)
>> +{
>> +	struct ppl_conf *ppl_conf;
>> +	struct mddev *mddev = conf->mddev;
>> +	int ret = 0;
>> +	int i;
>> +	bool need_cache_flush;
>> +
>> +	if (PAGE_SIZE != 4096)
>> +		return -EINVAL;
>> +
>> +	if (mddev->bitmap_info.file || mddev->bitmap_info.offset) {
>> +		pr_warn("md/raid:%s PPL is not compatible with bitmap\n",
>> +			mdname(mddev));
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
>> +		pr_warn("md/raid:%s PPL is not compatible with journal\n",
>> +			mdname(mddev));
>> +		return -EINVAL;
>> +	}
> 
> shouldn't these two checks be put into md.c, which we load the rdev?

Then they would not be performed for external metadata and when enabling
ppl at runtime. I think we need this in a central place where ppl is
started.

>> +
>> +	ppl_conf = kzalloc(sizeof(struct ppl_conf), GFP_KERNEL);
>> +	if (!ppl_conf)
>> +		return -ENOMEM;
>> +
>> +	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
>> +	if (!ppl_conf->io_kc) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
>> +	if (!ppl_conf->io_pool) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
>> +	if (!ppl_conf->bs) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
>> +	if (!ppl_conf->meta_pool) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->mddev = mddev;
>> +	ppl_conf->count = conf->raid_disks;
>> +	ppl_conf->child_logs = kcalloc(ppl_conf->count, sizeof(struct ppl_log),
>> +				       GFP_KERNEL);
>> +	if (!ppl_conf->child_logs) {
>> +		ret = -ENOMEM;
>> +		goto err;
>> +	}
>> +
>> +	atomic64_set(&ppl_conf->seq, 0);
>> +
>> +	if (!mddev->external) {
>> +		ppl_conf->signature = ~crc32c_le(~0, mddev->uuid, sizeof(mddev->uuid));
>> +		ppl_conf->block_size = 512;
>> +	} else {
>> +		ppl_conf->block_size = queue_logical_block_size(mddev->queue);
>> +	}
> 
> Can we always set block_size 512? Don't really understand this part.

For native md arrays we can use 512 byte sectors, like in the rest of md
metadata, but for imsm the values in ppl_header_entry.data_sector must
be in logical sectors, due to compatibility with UEFI and Windows.

>> +
>> +	for (i = 0; i < ppl_conf->count; i++) {
>> +		struct ppl_log *log = &ppl_conf->child_logs[i];
>> +		struct md_rdev *rdev = conf->disks[i].rdev;
>> +
>> +		mutex_init(&log->io_mutex);
>> +		spin_lock_init(&log->io_list_lock);
>> +		INIT_LIST_HEAD(&log->io_list);
>> +		INIT_LIST_HEAD(&log->no_mem_stripes);
>> +
>> +		log->ppl_conf = ppl_conf;
>> +		log->rdev = rdev;
>> +
>> +		if (rdev) {
>> +			struct request_queue *q;
>> +
>> +			ret = ppl_validate_rdev(rdev);
>> +			if (ret)
>> +				goto err;
>> +
>> +			q = bdev_get_queue(rdev->bdev);
>> +			if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
>> +				need_cache_flush = true;
>> +		}
>> +	}
>> +
>> +	if (need_cache_flush)
>> +		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
>> +			mdname(mddev));
>> +
>> +	conf->ppl = ppl_conf;
>> +
>> +	return 0;
>> +err:
>> +	ppl_exit_log(ppl_conf);
>> +	return ret;
>> +}
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 02e02fe5b04e..21440b594878 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -739,7 +739,7 @@ static bool stripe_can_batch(struct stripe_head *sh)
>>  {
>>  	struct r5conf *conf = sh->raid_conf;
>>  
>> -	if (conf->log)
>> +	if (conf->log || conf->ppl)
>>  		return false;
> 
> Maybe add a new inline function into raid5-log.h

OK, I will move it.

>>  	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
>>  		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
>> @@ -936,6 +936,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>>  		}
>>  	}
>>  
>> +	if (conf->ppl && ppl_write_stripe(conf->ppl, sh) == 0)
>> +		return;
> 
> please put the one and the above raid5-cache part into a separate function into raid5-log.h
> 
>> +
>>  	for (i = disks; i--; ) {
>>  		int op, op_flags = 0;
>>  		int replace_only = 0;
>> @@ -3308,6 +3311,16 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
>>  			     &dd_idx, sh);
>>  }
>>  
>> +static void log_stripe_write_finished(struct stripe_head *sh)
>> +{
>> +	struct r5conf *conf = sh->raid_conf;
>> +
>> +	if (conf->log)
>> +		r5l_stripe_write_finished(sh);
>> +	else if (conf->ppl)
>> +		ppl_stripe_write_finished(sh);
>> +}
> 
> please put this into raid5-log.h
> 
>> +
>>  static void
>>  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  				struct stripe_head_state *s, int disks,
>> @@ -3347,7 +3360,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  		if (bi)
>>  			bitmap_end = 1;
>>  
>> -		r5l_stripe_write_finished(sh);
>> +		log_stripe_write_finished(sh);
>>  
>>  		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
>>  			wake_up(&conf->wait_for_overlap);
>> @@ -3766,7 +3779,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>>  				discard_pending = 1;
>>  		}
>>  
>> -	r5l_stripe_write_finished(sh);
>> +	log_stripe_write_finished(sh);
>>  
>>  	if (!discard_pending &&
>>  	    test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) {
>> @@ -4756,7 +4769,7 @@ static void handle_stripe(struct stripe_head *sh)
>>  
>>  	if (s.just_cached)
>>  		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
>> -	r5l_stripe_write_finished(sh);
>> +	log_stripe_write_finished(sh);
>>  
>>  	/* Now we might consider reading some blocks, either to check/generate
>>  	 * parity, or to satisfy requests
>> @@ -6120,6 +6133,14 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
>>  	return handled;
>>  }
>>  
>> +static void log_write_stripe_run(struct r5conf *conf)
>> +{
>> +	if (conf->log)
>> +		r5l_write_stripe_run(conf->log);
>> +	else if (conf->ppl)
>> +		ppl_write_stripe_run(conf->ppl);
>> +}
> 
> ditto
> 
>> +
>>  static int handle_active_stripes(struct r5conf *conf, int group,
>>  				 struct r5worker *worker,
>>  				 struct list_head *temp_inactive_list)
>> @@ -6157,7 +6178,7 @@ static int handle_active_stripes(struct r5conf *conf, int group,
>>  
>>  	for (i = 0; i < batch_size; i++)
>>  		handle_stripe(batch[i]);
>> -	r5l_write_stripe_run(conf->log);
>> +	log_write_stripe_run(conf);
>>  
>>  	cond_resched();
>>  
>> @@ -6735,6 +6756,8 @@ static void free_conf(struct r5conf *conf)
>>  
>>  	if (conf->log)
>>  		r5l_exit_log(conf->log);
>> +	if (conf->ppl)
>> +		ppl_exit_log(conf->ppl);
> 
> Ditto
> 
>>  	if (conf->shrinker.nr_deferred)
>>  		unregister_shrinker(&conf->shrinker);
>>  
>> @@ -7196,6 +7219,13 @@ static int raid5_run(struct mddev *mddev)
>>  		BUG_ON(mddev->delta_disks != 0);
>>  	}
>>  
>> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
>> +	    test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
>> +			mdname(mddev));
>> +		clear_bit(MD_HAS_PPL, &mddev->flags);
>> +	}
>> +
>>  	if (mddev->private == NULL)
>>  		conf = setup_conf(mddev);
>>  	else
>> @@ -7422,6 +7452,11 @@ static int raid5_run(struct mddev *mddev)
>>  			 mdname(mddev), bdevname(journal_dev->bdev, b));
>>  		if (r5l_init_log(conf, journal_dev))
>>  			goto abort;
>> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		pr_debug("md/raid:%s: enabling distributed Partial Parity Log\n",
>> +			 mdname(mddev));
>> +		if (ppl_init_log(conf))
>> +			goto abort;
>>  	}
> 
> Ditto, if possible
> 
>>  
>>  	return 0;
>> @@ -7690,7 +7725,7 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
>>  	sector_t newsize;
>>  	struct r5conf *conf = mddev->private;
>>  
>> -	if (conf->log)
>> +	if (conf->log || conf->ppl)
>>  		return -EINVAL;
>>  	sectors &= ~((sector_t)conf->chunk_sectors - 1);
>>  	newsize = raid5_size(mddev, sectors, mddev->raid_disks);
>> @@ -7743,7 +7778,7 @@ static int check_reshape(struct mddev *mddev)
>>  {
>>  	struct r5conf *conf = mddev->private;
>>  
>> -	if (conf->log)
>> +	if (conf->log || conf->ppl)
> 
> there are other places we check conf->log. like stripe_can_batch(), I think the
> ppl code can't hanel batch right now.

Yes, I added this also in stripe_can_batch(). I checked all other places
where conf->log is used and I think that everything is covered in this
patchset, but I will check again.

>>  		return -EINVAL;
>>  	if (mddev->delta_disks == 0 &&
>>  	    mddev->new_layout == mddev->layout &&
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index 3cc4cb28f7e6..f915a7a0e752 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -230,6 +230,7 @@ struct stripe_head {
>>  	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
>>  
>>  	struct page		*ppl_page; /* partial parity of this stripe */
>> +	struct ppl_io_unit	*ppl_log_io;
> 
> Maybe we should put the ppl fields and raid5 fields to a union

Good idea, will do.

>>  	/**
>>  	 * struct stripe_operations
>>  	 * @target - STRIPE_OP_COMPUTE_BLK target
>> @@ -689,6 +690,7 @@ struct r5conf {
>>  	int			group_cnt;
>>  	int			worker_cnt_per_group;
>>  	struct r5l_log		*log;
>> +	struct ppl_conf		*ppl;
> 
> Maybe use void * log_private, so works for both raid5-cache and ppl

Do you mean replace "struct ppl_conf *ppl" with "void *log_private"?

>>  
>>  	struct bio_list		pending_bios;
>>  	spinlock_t		pending_bios_lock;
>> @@ -798,4 +800,10 @@ extern void r5c_check_cached_full_stripe(struct r5conf *conf);
>>  extern struct md_sysfs_entry r5c_journal_mode;
>>  extern void r5c_update_on_rdev_error(struct mddev *mddev);
>>  extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
>> +
>> +extern int ppl_init_log(struct r5conf *conf);
>> +extern void ppl_exit_log(struct ppl_conf *log);
>> +extern int ppl_write_stripe(struct ppl_conf *log, struct stripe_head *sh);
>> +extern void ppl_write_stripe_run(struct ppl_conf *log);
>> +extern void ppl_stripe_write_finished(struct stripe_head *sh);
> 
> These part and the raid5 part should be put into raid5-log.h
> 
>>  #endif
>> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
>> index fe2112810c43..2c28711cc5f1 100644
>> --- a/include/uapi/linux/raid/md_p.h
>> +++ b/include/uapi/linux/raid/md_p.h
>> @@ -398,4 +398,30 @@ struct r5l_meta_block {
>>  
>>  #define R5LOG_VERSION 0x1
>>  #define R5LOG_MAGIC 0x6433c509
>> +
>> +struct ppl_header_entry {
>> +	__le64 data_sector;	/* Raid sector of the new data */
>> +	__le32 pp_size;		/* Length of partial parity */
>> +	__le32 data_size;	/* Length of data */
>> +	__le32 parity_disk;	/* Member disk containing parity */
>> +	__le32 checksum;	/* Checksum of this entry */
> 
> So we changed the disk format. Will this new format be put into IMSM standard?
> Do we have a version to verify the old/new format for IMSM?

Yes, this will be included in IMSM. The platforms with support for this
are not yet released, so minor changes to the format were still
possible. This can be considered the first version of the PPL metadata
format. In the future if there will be any changes I think that for
example a version field could be inserted in place of
ppl_header.padding, which is currently always set to 0.

Thanks,
Artur
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html