Re: [PATCH] MD: add doc for raid5-cache

Jure Erznožnik <jure.erznoznik@xxxxxxxxx> · Thu, 2 Feb 2017 07:54:51 +0100

If I may, I'd also like to see the following in the manual:

1. Instructions on how to set up the cache. So far I have seen how to
change mode, but not how to even get to the part where you can (change
the mode)
2. List of all tweaking parameters with descriptions on what they do

Thanks for the fine work!

LP,
Jure

On Thu, Feb 2, 2017 at 7:33 AM, Ram Ramesh <rramesh2400@xxxxxxxxx> wrote:
> On 01/31/2017 01:18 PM, Shaohua Li wrote:
>>
>> I'm starting document of the raid5-cache feature. Please let me know
>> what else we should put into the document. Of course, comments are
>> welcome!
>>
>> Signed-off-by: Shaohua Li <shli@xxxxxx>
>> ---
>>   Documentation/md/raid5-cache.txt | 99
>> ++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 99 insertions(+)
>>   create mode 100644 Documentation/md/raid5-cache.txt
>>
>> diff --git a/Documentation/md/raid5-cache.txt
>> b/Documentation/md/raid5-cache.txt
>> new file mode 100644
>> index 0000000..17a6279
>> --- /dev/null
>> +++ b/Documentation/md/raid5-cache.txt
>> @@ -0,0 +1,99 @@
>> +RAID5 cache
>> +
>> +Raid 4/5/6 could include an extra disk for data cache. The cache could be
>> +in write-through or write-back mode. mdadm has a new option
>> +'--write-journal' to create array with cache. By default (raid array
>> +starts), the cache is in write-through mode. User can switch it to
>> +write-back mode by:
>> +
>> +echo "write-back" > /sys/block/md0/md/journal_mode
>> +
>> +And switch it back to write-through mode by:
>> +
>> +echo "write-through" > /sys/block/md0/md/journal_mode
>> +
>> +In both modes, all writes to the array will hit cache disk first. This
>> means
>> +the cache disk must be fast and sustainable (if you use a SSD as the
>> cache).
>> +
>> +-------------------------------------
>> +write-through mode:
>> +
>> +This mode mainly fixes 'write hole' issue. For RAID 4/5/6 array, an
>> +unclean shutdown could cause data in some stripes is not in consistent
>> +state, eg, data and parity don't match. The reason is a stripe write
>> +involves several raid disks and it's possible writes don't hit all raid
>> +disks yet before the unclean shutdown. After an unclean shutdown, MD try
>> +to 'resync' the array to put all stripes back into consistent state. In
>> +the resync, any disk failure will cause real data corruption. This
>> problem
>> +is called 'write hole'. So the 'write hole' issue occurs between unclean
>> +shutdown and 'resync'. This window isn't big. On the other hand, if one
>> +disk fails, other disks could fail soon, which happens sometimes if the
>> +disks are from the same vendor and manufactured in the same time. This
>> +will increase the chance of 'write whole', but overall the chance isn't
>> +big, so don't panic even not using cache disk.
>> +
>> +The write-through cache will cache all data in cache disk first. Until
>> the
>> +data hits into the cache disk, the data is flushed into RAID disks. The
>> +two-step write will guarantee MD can recover correct data after unclean
>> +shutdown even with disk failure. Thus the cache can close the 'write
>> +hole'.
>> +
>> +In write-through mode, MD reports IO finish to upper layer (usually
>> +filesystems) till the data hits RAID disks, so cache disk failure doesn't
>> +cause data lost. Of course cache disk failure means the array is exposed
>> +into 'write hole' again.
>> +
>> +--------------------------------------
>> +write-back mode:
>> +
>> +write-back mode fixes the 'write hole' issue too, since all write data is
>> +cached in cache disk. But the main goal of 'write-back' cache is to speed
>> up
>> +write. If a write crosses all raid disks of a stripe, we call it
>> full-stripe
>> +write. For non-full-stripe write, MD must do a read-modify-write. The
>> extra
>> +read (for data in other disks) and write (for parity) introduce a lot of
>> +overhead. Some writes which are sequential but not dispatched in the same
>> time
>> +will suffer from this overhead too. write-back cache will aggregate the
>> data
>> +and flush the data to raid disks till the data becomes a full stripe
>> write.
>> +This will completely avoid the overhead, so it's very helpful for some
>> +workloads. A typical workload which does sequential write and follows
>> fsync is
>> +an example.
>> +
>> +In write-back mode, MD reports IO finish to upper layer (usually
>> filesystems)
>> +right after the data hit cache disk. The data is flushed to raid disks
>> later
>> +after specific conditions met. So cache disk failure will cause data
>> lost.
>> +
>> +--------------------------------------
>> +The implementation:
>> +
>> +The write-through and write-back cache use the same disk format. The
>> cache disk
>> +is organized as a simple write log. The log consists of 'meta data' and
>> 'data'
>> +pairs. The meta data describes the data. It also includes checksum and
>> sequence
>> +ID for recovery identification. Data could be IO data and parity data.
>> Data is
>> +checksumed too. The checksum is stored in the meta data ahead of the
>> data. The
>> +checksum is an optimization because MD can write meta and data freely
>> without
>> +worry about the order. MD superblock has a field pointed to the valid
>> meta data
>> +of log head.
>> +
>> +The log implementation is pretty straightforward. The difficult part is
>> the
>> +order MD write data to cache disk and raid disks. Specifically, in
>> +write-through mode, MD calculates parity for IO data, writes both IO data
>> and
>> +parity to the log, write the data and parity to raid disks after the data
>> and
>> +parity is settled down in log and finally the IO is finished. Read just
>> reads
>> +from raid disks as usual.
>> +
>> +In write-back mode, MD writes IO data to the log and reports IO finish.
>> The
>> +data is also fully cached in memory at that time, which means read must
>> query
>> +memory cache. If some conditions are met, MD will flush the data to raid
>> disks.
>> +MD will calculate parity for the data and write parity into the log.
>> After this
>> +is finished, MD will write both data and parity into raid disks, then MD
>> can
>> +release the memory cache. The flush conditions could be stripe becomes a
>> full
>> +stripe write, free cache disk space is low or in-kernel memory cache
>> space is
>> +low.
>> +
>> +After an unclean shutdown, MD does recovery. MD reads all meta data and
>> data
>> +from the log. The sequence ID and checksum will help us detect corrupted
>> meta
>> +data and data. If MD finds a stripe with data and valid parities (1
>> parity for
>> +raid4/5 and 2 for raid6), MD will write the data and parities to raid
>> disks. If
>> +parities are incompleted, they are discarded. If part of data is
>> corrupted,
>> +they are discarded too. MD then loads valid data and writes them to raid
>> disks
>> +in normal way.
>
>
> Which version of mdadm/kernel supports this feature? Is it already released
> or in the process?
>
> Ramesh
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html