Re: Split RAID: Proposal for archival RAID using incremental batch checksum

Anshuman Aggarwal <anshuman.aggarwal@xxxxxxxxx> · Wed, 26 Nov 2014 11:54:35 +0530

On 25 November 2014 at 04:20, NeilBrown <neilb@xxxxxxx> wrote:
> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@xxxxxxxxx> wrote:
>
>> On 3 November 2014 at 11:22, NeilBrown <neilb@xxxxxxx> wrote:
>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>> > <anshuman.aggarwal@xxxxxxxxx> wrote:
>> >
>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>> >> parity be invalidated for any write to any of the disks (assuming md
>> >> operates at a chunk level)...also please see my reply below
>> >
>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>> > operates in units of 1 page (4K).
>>
>> It appears that my requirement may be met by a partitionable md raid 4
>> array where the partitions are all on individual underlying block
>> devices not striped across the block devices. Is that currently
>> possible with md raid? I dont' see how but such an enhancement could
>> do all that I had outlined earlier
>>
>> Is this possible to implement using RAID4 and MD already?
>
> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
> Rounding down the size of your drives to match that could waste nearly half
> the space.  However it should work as a proof-of-concept.
>
> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
> RAID4/5/6 would be quite possible.
>
>>   can the
>> partitions be made to write to individual block devices such that
>> parity updates don't require reading all devices?
>
> md/raid4 will currently tries to minimize total IO requests when performing
> an update, but prefer spreading the IO over more devices if the total number
> of requests is the same.
>
> So for a 4-drive RAID4, Updating a single block can be done by:
>   read old data block, read parity, write data, write parity - 4 IO requests
> or
>   read other 2 data blocks, write data, write parity - 4 IO requests.
>
> In this case it will prefer the second, which is not what you want.
> With 5-drive RAID4, the second option will require 5 IO requests, so the first
> will be chosen.
> It is quite trivial to flip this default for testing
>
> -       if (rmw < rcw && rmw > 0) {
> +       if (rmw <= rcw && rmw > 0) {
>
>
> If you had 5 drives, you could experiment with no code changes.
> Make the chunk size the largest power of 2 that fits in the device, and then
> partition to align the partitions on those boundaries.

If the chunk size is almost the same as the device size, I assume the
entire chunk is not invalidated for parity on writing to a single
block? i.e. if only 1 block is updated only that blocks parity will be
read and written and not for the whole chunk? If thats' the case, what
purpose does a chunk serve in md raid ? If that's not the case, it
wouldn't work because a single block updation would lead to parity
being written for the entire chunk, which is the size of the device

I do have more than 5 drives though they are in use currently. I will
create a small testing partition on each device of the same size and
run the test on that after ensuring that the drives do go to sleep.

>
> NeilBrown
>

Thanks,
Anshuman
>
>>
>> To illustrate:
>> -----------------RAID - 4 ---------------------
>> |
>> Device 1       Device 2       Device 3       Parity
>> A1                 B1                 C1                P1
>> A2                 B2                 C2                P2
>> A3                 B3                 C3                P3
>>
>> Each device gets written to independently (via a layer of block
>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
>> blocks leading to updation of P1, P2 P3 (without causing any reads on
>> devices 2 and 3 using XOR for the parity).
>>
>> In RAID4, IIUC data gets striped and all devices become a single block device.
>>
>>
>> >
>> >
>> >>
>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@xxxxxxxxx> wrote:
>> >> > Right on most counts but please see comments below.
>> >> >
>> >> > On 29 October 2014 14:35, NeilBrown <neilb@xxxxxxx> wrote:
>> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>> >> >> devices contains an independent filesystem and could be accessed directly if
>> >> >> needed.  Each of the X devices contains some codes so that if at most X
>> >> >> devices in total died, you would still be able to recover all of the data.
>> >> >> If more than X devices failed, you would still get complete data from the
>> >> >> working devices.
>> >> >>
>> >> >> Every update would only write to the particular N device on which it is
>> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> >> than X for the spin-down to be really worth it.
>> >> >>
>> >> >> Am I right so far?
>> >> >
>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>> >> > devices to 1 data) so spin down is totally worth it for data
>> >> > protection but more on that below.
>> >> >
>> >> >>
>> >> >> For some reason the writes to X are delayed...  I don't really understand
>> >> >> that part.
>> >> >
>> >> > This delay is basically designed around archival devices which are
>> >> > rarely read from and even more rarely written to. By delaying writes
>> >> > on 2 criteria ( designated cache buffer filling up or preset time
>> >> > duration from last write expiring) we can significantly reduce the
>> >> > writes on the parity device. This assumes that we are ok to lose a
>> >> > movie or two in case the parity disk is not totally up to date but are
>> >> > more interested in device longevity.
>> >> >
>> >> >>
>> >> >> Sounds like multi-parity RAID6 with no parity rotation and
>> >> >>   chunksize == devicesize
>> >> > RAID6 would present us with a joint device and currently only allows
>> >> > writes to that directly, yes? Any writes will be striped.
>> >
>> > If the chunksize equals the device size, then you need a very large write for
>> > it to be striped.
>> >
>> >> > In any case would md raid allow the underlying device to be written to
>> >> > directly? Also how would it know that the device has been written to
>> >> > and hence parity has to be updated? What about the superblock which
>> >> > the FS would not know about?
>> >
>> > No, you wouldn't write to the underlying device.  You would carefully
>> > partition the RAID5 so each partition aligns exactly with an underlying
>> > device.  Then write to the partition.
>> >
>> >> >
>> >> > Also except for the delayed checksum writing part which would be
>> >> > significant if one of the objectives is to reduce the amount of
>> >> > writes. Can we delay that in the code currently for RAID6? I
>> >> > understand the objective of RAID6 is to ensure data recovery and we
>> >> > are looking at a compromise in this case.
>> >
>> > "simple matter of programming"
>> > Of course there would be a limit to how much data can be buffered in memory
>> > before it has to be flushed out.
>> > If you are mostly storing movies, then they are probably too large to
>> > buffer.  Why not just write them out straight away?
>> >
>> > NeilBrown
>> >
>> >
>> >
>> >> >
>> >> > If feasible, this can be an enhancement to MD RAID as well where N
>> >> > devices are presented instead of a single joint device in case of
>> >> > raid6 (maybe the multi part device can be individual disks?)
>> >> >
>> >> > It will certainly solve my problem of where to store the metadata. I
>> >> > was currently hoping to just store it as a configuration file to be
>> >> > read by the initramfs since in this case worst case scenario the
>> >> > checksum goes out of sync and is rebuilt from scratch.
>> >> >
>> >> >>
>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> >> >> impartial opinion from me on that topic.
>> >> >
>> >> > I haven't hacked around the kernel internals much so far so will have
>> >> > to dig out that history. I will welcome any particular links/mail
>> >> > threads I should look at for guidance (with both yours and opposing
>> >> > points of view)
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html