Re: [PATCH v3 3/8] raid5: A caching layer for RAID5/6

Neil Brown <neilb@xxxxxxx> · Thu, 18 Jun 2015 11:00:36 +1000

On Wed, 3 Jun 2015 15:48:38 -0700 Shaohua Li <shli@xxxxxx> wrote:

Hi,
 sorry for the delay in getting to this.

3000 lines of code is rather hard to review - especially with so few
comments :-)
There seem to be a number of fairly well defined modules in the code,
such as managing the various data structures.  Maybe if these arrived
one per patch it would be easier  to review them individually.

> Main goal of the caching layer is to aggregate IO write to hopefully
> make full stripe IO write and fix write hole issue. This might speed up
> read too, but it's not optimized for read, eg, we don't proactively
> cache data for read. The aggregation makes a lot of sense for workloads
> which sequentially write to several files with/without fsync. Such
> workloads are popular in today's datacenter.
> 
> Write IO data will write to a cache disk (SSD) first, then later the
> data will be flushed to raid disks.
> 
> The cache disk will be organized as a simple ring buffer log. For IO
> data, a tuple (raid_sector, io length, checksum, data) will be appended
> to the log; for raid 5/6 parity, a tuple (stripe_sector, parity length,
> checksum, parity data) will be appended to the log. We don't have
> on-disk index for the data appended to the log. So either we can rebuild
> an in-memory index at startup with scanning the whole log, or we can
> flush all data from cache disk to raid disks at shutdown so cache disk
> has no valid data. Current code chooses the second option, but this can
> be easily changed.
> 
> We have a simple meta data for above tuples. It's essentially a
> tuple (sequence, metadata length).

How are these tuples store in the log?  One tuples per block?  A block
with lots of tuples followed by all the described data?  Or does the
block of tuples follow the data?

>                                    Crash recovery or startup will scan
> the log to read the metadata and rebuild in-memory index. If metadata is
> broken at the head of the log, even metadata afterward is ok, the
> scanning will not work well. So we take some steps to mitigate the
> issue:
> -A flush request is sent to cache disk every second to relieve the issue

I don't feel at all comfortable about the every-second flush.
I can certainly understand a flush after 50% of the log space is
written, or even 25%.  but time-based flushes should be coming from the
filesystem.

> -FLUSH/FUA request is carefully handled. FLUSH/FUA will only be
> dispatched after all previous requests finish.

This certainly makes sense.

> 
> The in-memory index is a simple list of io_range (sequence, metadata
> sector, data sector, length). The list is orded by sequence. The first
> io_range entry's metadata sector is the tail of the log. There is also a
> struct to track io ranges within a stripe. All stripes will be organized
> as a radix tree.

It would be useful here to briefly say how these data structures are
used.
I assumed the order-by-sequence list is to flush data to RAID when the
log is getting full?

The radix-tree tracks where in the log each stripe is - is that correct?
So it is easy to find the stripe at the tail of the list to flush it
out.

> 
> All IO data will be copied to a memory pool for caching too until the
> data is flushed to raid disks. This is just to simplify the
> implementation, it's not mandated. In the future flush can do a data
> copy from cache disk to raid disks, so the memory pool can be discarded.
> If a stripe is flushed to raid disks, memory of the stripe can be
> reused.

Again, why a separate memory pool rather than just leaving the data in
the stripe cache until it is safe in the RAID?

> 
> We have two limited resources here, memory pool and cache disk space. If
> resource is tight, we will do reclaim. In either case, we will flush
> some data from cache disk to raid disks. However, we implement different
> strategies. For memory reclaim, we prefer reclaiming full stripe. For
> cache disk space reclaim, we prefer reclaiming io_range entry at the
> head of index list.

Wouldn't full stripes be scheduled to the RAID immediately they become
full (or immediately after the parity is safe in the log)?
So for memory reclaim it would make sense to prefer stripes that have
been idle for a long time - so an LRU list.

Certainly when the log approaches full you need to start flushing
things at the start of the log.

> 
> If cache disk has IO error, since all data are in memory pool, we will
> flush all data to raid disks and fallback to no-cache-disk mode. IO
> error of cache disk doesn't corrupt any data. After some time, we will
> try to use cache disk again if the disk is ok. The retry will stop after
> several rounds of failure.

The normal data flow involves lots of writing to the cache device and
very little if any reading.  So we will probably need some sort of
"scrub" process to read and verify the log occasionally, just to be
sure that reads still work.  That could be largely done in user-space.

> 
> We always do reclaim in stripe unit. Reclaim could create holes in the
> log, eg, some io_range in the middle is reclaimed, but io_range at the
> head remains. So the index list entries don't always have continuous
> sequence. But this doesn't matter, the first io_range is always the log
> tail. Superblock has a field pointing to the position of log tail. The
> hole can waste a lot of disk space though. In the future, we can
> implement a garbage collection to mitigate the issue, eg, copy data
> from the index tail to head.

I doubt there would be value in garbage collection.  Just flush the old
stripes to the RAID.  Maybe I'm wrong, but I certainly agree that it
isn't a priority.

> 
> In the process reclaim flush data to raid disk, stripe parity will be
> append to cache log. Parity is always appended after its corresponding
> data. After parity is in cache disk, a flush_start block is appended to
> log, which indicates stripe is flushing to raid disks. Data writing to
> raid disks only happens after all data and parity are already in cache
> disk. This will fix the write whole issue. After a stripe is flushed to
> raid disks, a flush_end block is appended to log, which indicates a
> stripe is settled down in raid disks.

I'm not sure that a separate "flush start" block is needed. Once all
the parity blocks have been written we can assume that the flush has
started.
A "flush-end" block does make sense, though I would think of it as an
'invalidate' block in that it invalidates specific previous data blocks.
Same concept though.

> 
> Recovery relies on the flush_start and flush_end block. If recovery
> finds data and parity of a stripe, the flush start/end block will be
> used to determine which stage the stripe is in flush. If the stripe is
> listed in flush end block, the stripe is in raid disks, all data and
> parity of the stripe can be discarded. If the stripe isn't listed in
> flush start block, the stripe hasn't started flush to raid yet, its
> parity can be ignored. Otherwise, the stripe is in the middle of
> flushing to raid disks.  Since we have both data and parity, the
> recovery just rewrite them to raid disks.
> 
> IO write code path:
> 1. copy bio data to stripe memory pages
> 2. append metadata and data to cache log
> 3. IO write endio
> 
> reclaim code path:
> 1. select stripe to reclaim
> 2. write all stripe data to raid disk
> 3. in raid5 ops_run_io, append metadata and parity data to cache log.
>     ops_run_io doesn't write data/parity to raid disks at this time
> 4. flush cache disk and write flush_start block
> 5. ops_run_io continues. data/parity will be written to raid disks
> 6. flush all raid disks cache
> 7. write flush_end block
> 8. delete in-memory index of the stripe, and advance superblock log checkpoint

I find this a bit confusing.  Step 2 writes to the raid disk, but step
3 says it doesn't write to the raid disk yet.  It also doesn't tell me
when parity is calculated.

I imagine:

1. select stripe to reclaim
2. read missing data or parity block so new parity can be calculated.
3. calculate parity and write to log - this records that the flush has
   started.
4. flush cache device - probably multiple stripes will get up to the
   step, and then a single flush will be performed for all of them.
5. schedule writes to the RAID disks, both data an parity.
6. flush RAID disks - again, wait for lots of stripes to reach this
   point, then perform a single flush
7. write flush_end / invalidate block for all of the flushed stripes
8. delete in-memory index and advance superblock log checkpoint.
   flush the invalidate blocks before writing the superblock.

If you get '4' for one set of stripes to line up with '8' for the
previous set of stripes, then you can combine the two 'flush'
operations to the cache devices.
So the cache device see:
  FLUSH superblock update, new parity writes, flush_end of older writes FLUSH
which data writes sprinkled through wherever needed.

> 
> Recovery:
> Crash in IO write code path doesn't need recovery. If data and checksum
> don't match, the data will be ignored so read will return old data. In
> reclaim code path, crash before step 4 doesn't need recovery as
> data/parity don't touch raid disk yet. Parity can be ignored too. crash
> after 7 doesn't need recovery too, as the stripe is fully flushed to
> raid disks. Crash between 4 and 7 need recovery. Data and parity in the
> log will be written to raid disks.

So recovery involves reading everything in the log, adding data and
parity to the stripe cache as it is found, invalidating any entries
when an 'invalidate' block is found.
Then any stripe that has all required parity is written to the RAID,
and any other stripe is discarded.
Then the log is emptied and we start from scratch.

> 
> The performance of the raid will largely be determined by reclaim speed
> at run time. Several stages of the reclaim process involves IO wait or
> disk cache flush, which significantly impact the raid disk utilization
> and performance. The flush_start and flush_end block make running
> multiple reclaim possible. Each reclaim only records stripes in the
> flush start/end block which it is reclaiming.  Recovery can use the
> information to correctly determine stripe's flush stage. An example of 2
> reclaimer:
> 
> xxx1 belongs to stripe1 and the same for stripe2
> 
> reclaim 1:  |data1|...|parity1|...|flush_start1|...|flush_end1|
> reclaim 2:      |data2|..|parity2|...............|flush_start2|...|flush_end2|
> 
> the reclaims only record its own stripes in flush block. If, for
> exmaple, recovery finds flush_start1, it knows stripe1 is flushing to
> raid disk. Recovery will ignore stripe2, since stripe2 isn't in
> flush_start1.
> 
> Multiple reclaim will efficiently solve the performance issue. Current
> code hasn't add multiple reclaim yet, but certainly will be added soon.

Maybe this is what you have in mind, but I would use a state-machine
approach for the reclaim
i.e. schedule a collection of stripes for parity calculation.
    as they complete, schedule the writes to cache
    When there are no pending writes to cache, schedule a flush
    etc.

> 
> V3:
> -fix a use after free bug. fix bio to cache disk crosses disk boundary
> -adjust memory watermark
> V2:
> -fix bugs and code clean up
> -log meta data write isn't FUA any more
> -discard only runs when the discard area is big enough
> 
> Signed-off-by: Shaohua Li <shli@xxxxxx>
> ---
>  drivers/md/Makefile            |    2 +-
>  drivers/md/raid5-cache.c       | 3246 ++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c             |   69 +-
>  drivers/md/raid5.h             |   15 +-
>  include/uapi/linux/raid/md_p.h |   72 +
>  5 files changed, 3392 insertions(+), 12 deletions(-)
>  create mode 100644 drivers/md/raid5-cache.c
> 
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index dba4db5..aeb4330 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -16,7 +16,7 @@ dm-cache-mq-y   += dm-cache-policy-mq.o
>  dm-cache-cleaner-y += dm-cache-policy-cleaner.o
>  dm-era-y	+= dm-era-target.o
>  md-mod-y	+= md.o bitmap.o
> -raid456-y	+= raid5.o
> +raid456-y	+= raid5.o raid5-cache.o
>  
>  # Note: link order is important.  All raid personalities
>  # and must come before md.o, as they each initialise 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> new file mode 100644
> index 0000000..c21d2f2
> --- /dev/null
> +++ b/drivers/md/raid5-cache.c
> @@ -0,0 +1,3246 @@
> +/*
> + * A caching layer for raid 4/5/6
> + *
> + * Copyright (C) 2015 Shaohua Li <shli@xxxxxx>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> + */
> +#include <linux/kernel.h>
> +#include <linux/wait.h>
> +#include <linux/blkdev.h>
> +#include <linux/slab.h>
> +#include <linux/raid/md_p.h>
> +#include <linux/crc32.h>
> +#include <linux/list_sort.h>
> +#include "md.h"
> +#include "raid5.h"
> +
> +/* the log disk cache will be flushed forcely every second */
> +#define LOG_FLUSH_TIME HZ
> +#define LOG_SUPER_WRITE_TIME (5 * HZ)
> +
> +#define MAX_MEM (256 * 1024 * 1024)
> +#define RECLAIM_BATCH 16
> +#define CHECKPOINT_TIMEOUT (5 * HZ)
> +
> +#define MAX_RETRY 10
> +#define IOERR_RETRY_TIME (5 * 60 * HZ)
> +#define MEMERR_RETRY_TIME (60 * HZ)
> +
> +typedef u64 r5blk_t; /* log blocks, could be 512B - 4k */

Why isn't this just "sector_t" ??

> +
> +struct r5l_log {
> +	struct r5c_cache *cache;

It isn't immediately clear what the "log" is separate from the "cache".
If there is a good reason, then a comment with that reason here would
be very helpful.

> +	struct block_device *bdev;
> +	struct md_rdev *rdev;

Is there a reason for storing both of these instead of just 'rdev' and
using 'rdev->bdev' when that is needed?

> +
> +	struct page *super_page;
> +	u8 uuid[16];
> +	u32 uuid_checksum_data;
> +	u32 uuid_checksum_meta;
> +
> +	unsigned int block_size; /* bytes */
> +	unsigned int block_sector_shift;
> +	unsigned int page_block_shift;
> +	unsigned int stripe_data_size; /* sector */
> +	unsigned int chunk_size; /* sector */
> +	unsigned int stripe_size; /* sector */
> +	unsigned int parity_disks;

Some of these are obvious, some are not.
Some more specific comments would help.

> +
> +	r5blk_t total_blocks;
> +	r5blk_t first_block;
> +	r5blk_t last_block;
> +
> +	r5blk_t low_watermark; /* For disk space */
> +	r5blk_t high_watermark;
> +
> +	r5blk_t last_checkpoint;
> +	r5blk_t last_freepoint;
> +	r5blk_t super_point;
> +	u64 last_cp_seq;
> +	u64 last_freeseq;
> +	unsigned long last_cp_time;
> +	struct work_struct discard_work;
> +
> +	int do_discard;
> +
> +	u64 seq; /* get after read log */
> +	r5blk_t log_start; /* get after read log */
> +
> +	u8 data_checksum_type;
> +	u8 meta_checksum_type;
> +
> +	unsigned int reserved_blocks;
> +	wait_queue_head_t space_waitq;
> +
> +	struct mutex io_mutex;
> +	struct r5l_io_unit *current_io;
> +
> +	spinlock_t io_list_lock;
> +	struct list_head running_ios; /* order is important */
> +
> +	struct list_head task_list;
> +	struct list_head parity_task_list;
> +	spinlock_t task_lock;
> +
> +	struct kmem_cache *io_kc;
> +	mempool_t *io_pool;
> +	struct bio_set *bio_set;
> +
> +	unsigned long next_flush_time;
> +	struct work_struct flush_work;
> +};
> +
> +struct r5l_task;
> +/* end function must free task */
> +typedef void (r5l_task_end_fn)(struct r5l_task *task, int error);
> +struct r5l_task {

A comment here telling me what these tasks do would help me understand
the data structure.

> +	struct list_head list;
> +	int type;
> +	struct bio *bio;
> +	union {
> +		struct bio *orig_bio;
> +		struct {
> +			sector_t stripe_sector;
> +			struct page *page_p;
> +			struct page *page_q;
> +		};
> +	};
> +	/* tasks in a single r5l_io_unit will have the same seq and meta_start */
> +	sector_t meta_start;
> +	sector_t data_start;
> +	u64 seq;
> +	r5l_task_end_fn *fn;
> +	void *private;
> +	unsigned int reserved_blocks;
> +	u32 checksum[];
> +};
> +
> +/*
> + * Data and metadata in log is written with normal IO write. A power failure
> + * can cause data loss. To relieve the issue, log disk cache will be flushed
> + * forcely every second.
> + *
> + * Note IO is finished out of order. If metadata corrupts in the middle,
> + * recovery can't work well even metadata/data at the tail is good. IO in log
> + * tail could finish earlier than IO ahead, so we must be very careful to
> + * handle FLUSH/FUA bio.
> + *
> + * FLUSH bio: the bio will be dispatched after all previous IO finish. FLUSH
> + * syntax doesn't require pending IO finish, but pending IO might be a metadata
> + * in the middle of log, we must force this order.
> + *
> + * FUA bio: same like FLUSH bio, previous meta IO must be finished before
> + * dispatching this bio. To simplify implementation, we wait all previous IO.
> + * And we must add a FLUSH to make sure previous IO hit disk media. metadata
> + * and the bio itself must be written with FUA.
> + * */
> +struct r5l_io_unit {
> +	struct r5l_log *log;
> +	struct list_head log_sibling;
> +
> +	struct page *meta_page;
> +	sector_t meta_sector;
> +	int meta_offset;
> +	u64 seq;
> +	struct bio *meta_bio;
> +
> +	struct list_head tasks;
> +	struct bio *current_bio;
> +	atomic_t refcnt;
> +
> +	unsigned int has_flush:1; /* include flush request */
> +	unsigned int has_fua:1; /* include fua request */
> +	unsigned int has_null_flush:1; /* include empty flush request */
> +	/*
> +	 * io isn't sent yet, flush/fua request can only be submitted till it's
> +	 * the first IO in running_ios list
> +	 * */
> +	unsigned int io_deferred:1;
> +	int error;
> +};
> +
> +struct r5c_io_range {
> +	struct list_head log_sibling;
> +	struct list_head stripe_sibling;
> +
> +	u64 seq;
> +
> +	sector_t meta_start; /* cache position */
> +	sector_t data_start; /* cache position */
> +	sector_t raid_start;
> +	unsigned int data_sectors;
> +
> +	struct r5c_stripe *stripe;
> +	union {
> +		struct bio *bio;
> +		u32 *checksum; /* only for recovery */
> +	};
> +};
> +
> +struct r5c_stripe {
> +	u64 raid_index;
> +	struct r5c_cache *cache;
> +	atomic_t ref;
> +	int state;
> +	int recovery_state; /* just for recovery */
> +
> +	struct list_head io_ranges; /* order list */
> +	union {
> +		struct list_head stripes;
> +		struct list_head parity_list; /* just for recovery */
> +	};
> +
> +	struct list_head lru;
> +
> +	int existing_pages;
> +	atomic_t dirty_stripes;
> +	atomic_t pending_bios;
> +	struct page **parity_pages; /* just for recovery */
> +	struct page *data_pages[];
> +};
> +
> +enum {
> +	STRIPE_RUNNING = 0,
> +	STRIPE_FROZEN = 1, /* Doesn't accept new IO */
> +	STRIPE_PARITY_DONE = 2,
> +	STRIPE_INRAID = 3,
> +	STRIPE_DEAD = 4,
> +
> +	RECOVERY_NO_FLUSH = 0,
> +	RECOVERY_FLUSH_START = 1, /* stripe in a start flush block */
> +	RECOVERY_FLUSH_END = 2,
> +};

I wouldn't be surprised if some compilers rejected that.  Two separate
enums would be better.

> +
> +#define STRIPE_LOCK_BITS 8
> +struct r5c_cache {
> +	struct mddev *mddev;
> +	struct md_rdev *rdev;
> +
> +	struct r5l_log log;
> +

Ahh - the log is embedded in the cache.
That probably makes sense.  But you don't need a pointer from the log
to the cache, just use container_of().

> +	spinlock_t tree_lock; /* protect stripe_tree, log_list, full_stripes */
> +	struct radix_tree_root stripe_tree;
> +	struct list_head log_list; /* sorted list of io_range */
> +	struct list_head full_stripes;
> +	int full_stripe_cnt;
> +
> +	struct list_head page_pool;
> +	spinlock_t pool_lock;
> +	u64 free_pages;
> +	u64 total_pages;
> +	u64 max_pages;
> +	u64 low_watermark; /* for memory, pages */
> +	u64 high_watermark;
> +
> +	unsigned int stripe_data_size; /* stripe size excluding parity, sector */
> +	unsigned int chunk_size; /* one disk chunk size including parity, sector */
> +	unsigned int stripe_size; /* stripe size including parity, sector */
> +	unsigned int parity_disks;
> +	unsigned int stripe_data_pages;
> +	unsigned int stripe_parity_pages;
> +
> +	unsigned int reclaim_batch;
> +
> +	unsigned int reserved_space; /* log reserved size, sector */
> +
> +	unsigned long reclaim_reason;
> +	wait_queue_head_t reclaim_wait;
> +	struct md_thread *reclaim_thread;
> +	__le64 *stripe_flush_data;
> +	int quiesce_state;
> +
> +	int in_recovery;
> +
> +	struct work_struct pending_io_work;
> +
> +	spinlock_t stripe_locks[1 << STRIPE_LOCK_BITS];
> +	wait_queue_head_t stripe_waitq[1 << STRIPE_LOCK_BITS];
> +
> +	int error_state;
> +	int error_type;
> +	int retry_cnt;
> +	unsigned long next_retry_time;
> +	struct bio_list retry_bio_list;
> +	wait_queue_head_t error_wait;
> +
> +	struct kmem_cache *io_range_kc;
> +	struct kmem_cache *stripe_kc;
> +	struct bio_set *bio_set;
> +};
> +
> +enum {
> +	RECLAIM_MEM = 0, /* work hard to reclaim memory */
> +	RECLAIM_MEM_BACKGROUND = 1, /* try to reclaim memory */
> +	RECLAIM_MEM_FULL = 2, /* only reclaim full stripe */
> +	RECLAIM_DISK = 8, /* work hard to reclaim disk */
> +	RECLAIM_DISK_BACKGROUND = 9, /* try to reclaim disk */
> +	RECLAIM_FLUSH_ALL = 16, /* flush all data to raid */
> +
> +	QUIESCE_NONE = 0,
> +	QUIESCE_START = 1,
> +	QUIESCE_END = 2,
> +
> +	ERROR_NOERROR = 0,
> +	ERROR_PREPARE = 1, /* Had an error, flushing cache to raid */
> +	ERROR_FINISH = 2, /* Had an error, cache has no data */
> +};

Three enums here :-)

> +
> +#define PAGE_SECTOR_SHIFT (PAGE_SHIFT - 9)

Maybe this should go in md.h - raid1.c raid10.c and raid5.c all use
something like it.

... and that's as far as I got.
I really need this a bit more like md-cluster.c: add modules one at a
time which I can understand and review and units.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html