Re: Split RAID: Proposal for archival RAID using incremental batch checksum

Anshuman Aggarwal <anshuman.aggarwal@xxxxxxxxx> · Tue, 2 Dec 2014 17:26:58 +0530



It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
will find more space on my drives and do a larger test but don't see
why it shouldn't work)
Here are the following caveats (and questions):
- Neil, like you pointed out, the power of 2 chunk size will probably
need a code change (in the kernel or only in the userspace tool?)
    - Any performance or other reasons why a terabyte size chunk may
not be feasible?
- Implications of safe_mode_delay
    - Would the metadata be updated on the block device be written to
and the parity device as well?
    - If the drive  fails which is the same as the drive being written
to, would that lack of metadata updates to the other devices affect
reconstruction?
- Adding new devices (is it possible to move the parity to the disk
being added? How does device addition work for RAID4 ...is it added as
a zero-ed out device with parity disk remaining the same)


On 2 December 2014 at 03:16, NeilBrown <neilb@xxxxxxx> wrote:
> On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@xxxxxxxxx> wrote:
>
>> On 1 December 2014 at 21:30, Anshuman Aggarwal
>> <anshuman.aggarwal@xxxxxxxxx> wrote:
>> > On 26 November 2014 at 11:54, Anshuman Aggarwal
>> > <anshuman.aggarwal@xxxxxxxxx> wrote:
>> >> On 25 November 2014 at 04:20, NeilBrown <neilb@xxxxxxx> wrote:
>> >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
>> >>> <anshuman.aggarwal@xxxxxxxxx> wrote:
>> >>>
>> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@xxxxxxx> wrote:
>> >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>> >>>> > <anshuman.aggarwal@xxxxxxxxx> wrote:
>> >>>> >
>> >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>> >>>> >> parity be invalidated for any write to any of the disks (assuming md
>> >>>> >> operates at a chunk level)...also please see my reply below
>> >>>> >
>> >>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>> >>>> > operates in units of 1 page (4K).
>> >>>>
>> >>>> It appears that my requirement may be met by a partitionable md raid 4
>> >>>> array where the partitions are all on individual underlying block
>> >>>> devices not striped across the block devices. Is that currently
>> >>>> possible with md raid? I dont' see how but such an enhancement could
>> >>>> do all that I had outlined earlier
>> >>>>
>> >>>> Is this possible to implement using RAID4 and MD already?
>> >>>
>> >>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
>> >>> Rounding down the size of your drives to match that could waste nearly half
>> >>> the space.  However it should work as a proof-of-concept.
>> >>>
>> >>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
>> >>> RAID4/5/6 would be quite possible.
>> >>>
>> >>>>   can the
>> >>>> partitions be made to write to individual block devices such that
>> >>>> parity updates don't require reading all devices?
>> >>>
>> >>> md/raid4 will currently tries to minimize total IO requests when performing
>> >>> an update, but prefer spreading the IO over more devices if the total number
>> >>> of requests is the same.
>> >>>
>> >>> So for a 4-drive RAID4, Updating a single block can be done by:
>> >>>   read old data block, read parity, write data, write parity - 4 IO requests
>> >>> or
>> >>>   read other 2 data blocks, write data, write parity - 4 IO requests.
>> >>>
>> >>> In this case it will prefer the second, which is not what you want.
>> >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
>> >>> will be chosen.
>> >>> It is quite trivial to flip this default for testing
>> >>>
>> >>> -       if (rmw < rcw && rmw > 0) {
>> >>> +       if (rmw <= rcw && rmw > 0) {
>> >>>
>> >>>
>> >>> If you had 5 drives, you could experiment with no code changes.
>> >>> Make the chunk size the largest power of 2 that fits in the device, and then
>> >>> partition to align the partitions on those boundaries.
>> >>
>> >> If the chunk size is almost the same as the device size, I assume the
>> >> entire chunk is not invalidated for parity on writing to a single
>> >> block? i.e. if only 1 block is updated only that blocks parity will be
>> >> read and written and not for the whole chunk? If thats' the case, what
>> >> purpose does a chunk serve in md raid ? If that's not the case, it
>> >> wouldn't work because a single block updation would lead to parity
>> >> being written for the entire chunk, which is the size of the device
>> >>
>> >> I do have more than 5 drives though they are in use currently. I will
>> >> create a small testing partition on each device of the same size and
>> >> run the test on that after ensuring that the drives do go to sleep.
>> >>
>> >>>
>> >>> NeilBrown
>> >>>
>> >
>> > Wouldn't the meta data writes wake up all the disks in the cluster
>> > anyways (defeating the purpose)? This idea will require metadata to
>> > not be written out to each device (is that even possible or on the
>> > cards?)
>> >
>> > I am about to try out your suggestion with the chunk sizes anyways but
>> > thought about the metadata being a major stumbling block.
>> >
>>
>> And it seems to be confirmed that the metadata write is waking up the
>> other drives. On any write to a particular drive the metadata update
>> is accessing all the others.
>>
>> Am I correct in assuming that all metadata is currently written as
>> part of the block device itself and that the external metadata  is
>> still embedded in each of the block devices (only the format of the
>> metadata is defined externally?) I guess to implement this we would
>> need to store metadata elsewhere which may be a major development
>> work. Still that may be a flexibility desired in md raid for other
>> reasons...
>>
>> Neil, your thoughts.
>
> This is exactly why I suggested testing with existing code and seeing how far
> you can get.  Thanks.
>
> For a full solution we probably do need some code changes here, but for
> further testing you could:
> 1/ make sure there is no bitmap (mdadm --grow --bitmap=none)
> 2/ set the safe_mode_delay to 0
>      echo 0 > /sys/block/mdXXX/md/safe_mode_delay
>
> when it won't try to update the metadata until you stop the array, or a
> device fails.
>
> Longer term: it would probably be good to only update the bitmap on the
> devices that are being written to - and to merge all bitmaps when assembling
> the array.  Also when there is a bitmap, the safe_mode functionality should
> probably be disabled.
>
> NeilBrown
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html