I'm starting document of the raid5-cache feature. Please let me know what else we should put into the document. Of course, comments are welcome! Signed-off-by: Shaohua Li <shli@xxxxxx> --- Documentation/md/raid5-cache.txt | 99 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 99 insertions(+) create mode 100644 Documentation/md/raid5-cache.txt diff --git a/Documentation/md/raid5-cache.txt b/Documentation/md/raid5-cache.txt new file mode 100644 index 0000000..17a6279 --- /dev/null +++ b/Documentation/md/raid5-cache.txt @@ -0,0 +1,99 @@ +RAID5 cache + +Raid 4/5/6 could include an extra disk for data cache. The cache could be +in write-through or write-back mode. mdadm has a new option +'--write-journal' to create array with cache. By default (raid array +starts), the cache is in write-through mode. User can switch it to +write-back mode by: + +echo "write-back" > /sys/block/md0/md/journal_mode + +And switch it back to write-through mode by: + +echo "write-through" > /sys/block/md0/md/journal_mode + +In both modes, all writes to the array will hit cache disk first. This means +the cache disk must be fast and sustainable (if you use a SSD as the cache). + +------------------------------------- +write-through mode: + +This mode mainly fixes 'write hole' issue. For RAID 4/5/6 array, an +unclean shutdown could cause data in some stripes is not in consistent +state, eg, data and parity don't match. The reason is a stripe write +involves several raid disks and it's possible writes don't hit all raid +disks yet before the unclean shutdown. After an unclean shutdown, MD try +to 'resync' the array to put all stripes back into consistent state. In +the resync, any disk failure will cause real data corruption. This problem +is called 'write hole'. So the 'write hole' issue occurs between unclean +shutdown and 'resync'. This window isn't big. On the other hand, if one +disk fails, other disks could fail soon, which happens sometimes if the +disks are from the same vendor and manufactured in the same time. This +will increase the chance of 'write whole', but overall the chance isn't +big, so don't panic even not using cache disk. + +The write-through cache will cache all data in cache disk first. Until the +data hits into the cache disk, the data is flushed into RAID disks. The +two-step write will guarantee MD can recover correct data after unclean +shutdown even with disk failure. Thus the cache can close the 'write +hole'. + +In write-through mode, MD reports IO finish to upper layer (usually +filesystems) till the data hits RAID disks, so cache disk failure doesn't +cause data lost. Of course cache disk failure means the array is exposed +into 'write hole' again. + +-------------------------------------- +write-back mode: + +write-back mode fixes the 'write hole' issue too, since all write data is +cached in cache disk. But the main goal of 'write-back' cache is to speed up +write. If a write crosses all raid disks of a stripe, we call it full-stripe +write. For non-full-stripe write, MD must do a read-modify-write. The extra +read (for data in other disks) and write (for parity) introduce a lot of +overhead. Some writes which are sequential but not dispatched in the same time +will suffer from this overhead too. write-back cache will aggregate the data +and flush the data to raid disks till the data becomes a full stripe write. +This will completely avoid the overhead, so it's very helpful for some +workloads. A typical workload which does sequential write and follows fsync is +an example. + +In write-back mode, MD reports IO finish to upper layer (usually filesystems) +right after the data hit cache disk. The data is flushed to raid disks later +after specific conditions met. So cache disk failure will cause data lost. + +-------------------------------------- +The implementation: + +The write-through and write-back cache use the same disk format. The cache disk +is organized as a simple write log. The log consists of 'meta data' and 'data' +pairs. The meta data describes the data. It also includes checksum and sequence +ID for recovery identification. Data could be IO data and parity data. Data is +checksumed too. The checksum is stored in the meta data ahead of the data. The +checksum is an optimization because MD can write meta and data freely without +worry about the order. MD superblock has a field pointed to the valid meta data +of log head. + +The log implementation is pretty straightforward. The difficult part is the +order MD write data to cache disk and raid disks. Specifically, in +write-through mode, MD calculates parity for IO data, writes both IO data and +parity to the log, write the data and parity to raid disks after the data and +parity is settled down in log and finally the IO is finished. Read just reads +from raid disks as usual. + +In write-back mode, MD writes IO data to the log and reports IO finish. The +data is also fully cached in memory at that time, which means read must query +memory cache. If some conditions are met, MD will flush the data to raid disks. +MD will calculate parity for the data and write parity into the log. After this +is finished, MD will write both data and parity into raid disks, then MD can +release the memory cache. The flush conditions could be stripe becomes a full +stripe write, free cache disk space is low or in-kernel memory cache space is +low. + +After an unclean shutdown, MD does recovery. MD reads all meta data and data +from the log. The sequence ID and checksum will help us detect corrupted meta +data and data. If MD finds a stripe with data and valid parities (1 parity for +raid4/5 and 2 for raid6), MD will write the data and parities to raid disks. If +parities are incompleted, they are discarded. If part of data is corrupted, +they are discarded too. MD then loads valid data and writes them to raid disks +in normal way. -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html