Re: [PATCH] MD: add doc for raid5-cache

Song Liu <songliubraving@xxxxxx> · Wed, 1 Feb 2017 17:54:35 +0000

> On Jan 31, 2017, at 11:18 AM, Shaohua Li <shli@xxxxxx> wrote:
> 
> I'm starting document of the raid5-cache feature. Please let me know
> what else we should put into the document. Of course, comments are
> welcome!
> 
> Signed-off-by: Shaohua Li <shli@xxxxxx>
> ---
> Documentation/md/raid5-cache.txt | 99 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 99 insertions(+)
> create mode 100644 Documentation/md/raid5-cache.txt
> 
> diff --git a/Documentation/md/raid5-cache.txt b/Documentation/md/raid5-cache.txt
> new file mode 100644
> index 0000000..17a6279
> --- /dev/null
> +++ b/Documentation/md/raid5-cache.txt
> @@ -0,0 +1,99 @@
> +RAID5 cache
> +
> +Raid 4/5/6 could include an extra disk for data cache. The cache could be
> +in write-through or write-back mode. mdadm has a new option
> +'--write-journal' to create array with cache. By default (raid array
> +starts), the cache is in write-through mode. User can switch it to
> +write-back mode by:
> +
> +echo "write-back" > /sys/block/md0/md/journal_mode
> +
> +And switch it back to write-through mode by:
> +
> +echo "write-through" > /sys/block/md0/md/journal_mode
> +
> +In both modes, all writes to the array will hit cache disk first. This means
> +the cache disk must be fast and sustainable (if you use a SSD as the cache).
> +
> +-------------------------------------
> +write-through mode:
> +
> +This mode mainly fixes 'write hole' issue. For RAID 4/5/6 array, an
> +unclean shutdown could cause data in some stripes is not in consistent
> +state, eg, data and parity don't match. The reason is a stripe write
> +involves several raid disks and it's possible writes don't hit all raid
> +disks yet before the unclean shutdown. After an unclean shutdown, MD try
> +to 'resync' the array to put all stripes back into consistent state. In
> +the resync, any disk failure will cause real data corruption. This problem
> +is called 'write hole'. So the 'write hole' issue occurs between unclean
> +shutdown and 'resync'. This window isn't big. On the other hand, if one
> +disk fails, other disks could fail soon, which happens sometimes if the
> +disks are from the same vendor and manufactured in the same time. This
> +will increase the chance of 'write whole', but overall the chance isn't
> +big, so don't panic even not using cache disk.
> +
> +The write-through cache will cache all data in cache disk first. Until the
> +data hits into the cache disk, the data is flushed into RAID disks. The
> +two-step write will guarantee MD can recover correct data after unclean
> +shutdown even with disk failure. Thus the cache can close the 'write
> +hole'.
> +
> +In write-through mode, MD reports IO finish to upper layer (usually
> +filesystems) till the data hits RAID disks, so cache disk failure doesn't
> +cause data lost. Of course cache disk failure means the array is exposed
> +into 'write hole' again.
> +
> +--------------------------------------
> +write-back mode:
> +
> +write-back mode fixes the 'write hole' issue too, since all write data is
> +cached in cache disk. But the main goal of 'write-back' cache is to speed up
> +write. If a write crosses all raid disks of a stripe, we call it full-stripe
> +write. For non-full-stripe write, MD must do a read-modify-write. The extra
> +read (for data in other disks) and write (for parity) introduce a lot of
> +overhead. Some writes which are sequential but not dispatched in the same time
> +will suffer from this overhead too. write-back cache will aggregate the data
> +and flush the data to raid disks till the data becomes a full stripe write.
> +This will completely avoid the overhead, so it's very helpful for some
> +workloads. A typical workload which does sequential write and follows fsync is
> +an example.
> +
> +In write-back mode, MD reports IO finish to upper layer (usually filesystems)
> +right after the data hit cache disk. The data is flushed to raid disks later
> +after specific conditions met. So cache disk failure will cause data lost.
> +
> +--------------------------------------
> +The implementation:
> +
> +The write-through and write-back cache use the same disk format. The cache disk
> +is organized as a simple write log. The log consists of 'meta data' and 'data'
> +pairs. The meta data describes the data. It also includes checksum and sequence
> +ID for recovery identification. Data could be IO data and parity data. Data is
> +checksumed too. The checksum is stored in the meta data ahead of the data. The
> +checksum is an optimization because MD can write meta and data freely without
> +worry about the order. MD superblock has a field pointed to the valid meta data
> +of log head.
> +
> +The log implementation is pretty straightforward. The difficult part is the
> +order MD write data to cache disk and raid disks. Specifically, in
> +write-through mode, MD calculates parity for IO data, writes both IO data and
> +parity to the log, write the data and parity to raid disks after the data and
> +parity is settled down in log and finally the IO is finished. Read just reads
> +from raid disks as usual.
> +
> +In write-back mode, MD writes IO data to the log and reports IO finish. The
> +data is also fully cached in memory at that time, which means read must query
> +memory cache. If some conditions are met, MD will flush the data to raid disks.
> +MD will calculate parity for the data and write parity into the log. After this
> +is finished, MD will write both data and parity into raid disks, then MD can
> +release the memory cache. The flush conditions could be stripe becomes a full
> +stripe write, free cache disk space is low or in-kernel memory cache space is
> +low.
> +
> +After an unclean shutdown, MD does recovery. MD reads all meta data and data
> +from the log. The sequence ID and checksum will help us detect corrupted meta
> +data and data. If MD finds a stripe with data and valid parities (1 parity for
> +raid4/5 and 2 for raid6), MD will write the data and parities to raid disks. If
> +parities are incompleted, they are discarded. If part of data is corrupted,
> +they are discarded too. MD then loads valid data and writes them to raid disks
> +in normal way.
> -- 
> 2.9.3
> 

Looks great!

Reviewed-by: Song Liu <songliubraving@xxxxxx>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html