Re: Any benefity to write intent bitmaps on Raid1

Goswin von Brederlow <goswin-v-b@xxxxxx> · Sat, 11 Apr 2009 04:56:45 +0200

Neil Brown <neilb@xxxxxxx> writes:

> On Thursday April 9, goswin-v-b@xxxxxx wrote:
>> Neil Brown <neilb@xxxxxxx> writes:
>> 
>> > (*) I've been wondering about adding another bitmap which would record
>> > which sections of the array have valid data.  Initially nothing would
>> > be valid and so wouldn't need recovery.  Every time we write to a new
>> > section we add that section to the 'valid' sections and make sure that
>> > section is in-sync.
>> > When a device was replaced, we would only need to recover the parts of
>> > the array that are known to be invalid.
>> > As filesystem start using the new "invalidate" command for block
>> > devices, we could clear bits for sections that the filesystem says are
>> > not needed any more...
>> > But currently it is just a vague idea.
>> >
>> > NeilBrown
>> 
>> If you are up for experimenting I would go for a completly new
>> approach. Instead of working with physical blocks and marking where
>> blocks are used and out of sync how about adding a mapping layer on
>> the device and using virtual blocks. You reduce the reported disk size
>> by maybe 1% to always have some spare blocks and initialy all blocks
>> will be unmapped (unused). Then whenever there is a write you pick out
>> an unused block, write to it and change the in memory mapping of the
>> logical to physical block. Every X seconds, on a barrier or an sync
>> you commit the mapping from memory to disk in such a way that it is
>> synchronized between all disks in the raid. So every commited mapping
>> represents a valid raid set. After the commit of the mapping all
>> blocks changed between the mapping and the last can be marked as free
>> again. Better use the second last so there are always 2 valid mappings
>> to choose from after a crash.
>> 
>> This would obviously need a lot more space than a bitmap but space is
>> (relatively) cheap. One benefit imho should be that sync/barrier would
>> not have to stop all activity on the raid to wait for the sync/barrier
>> to finish. It just has to finalize the mapping for the commit and then
>> can start a new in memory mapping while the finalized one writes to
>> disk.
>
> While there is obviously real value in this functionality, I can't
> help thinking that it belongs in the file system, not the block
> device.

I believe it is the only way to actualy remove the race conditions
inherent in software raid and there are some uses that don't work well
with a filesystem. E.g. creating a filesystem with only a swapfile on
it instead of using a raid device seems a bit stupid. Or for databases
that use block devices.

> But then I've always seen logical volume management as an interim hack
> until filesystems were able to span multiple volumes in a sensible
> way.  As time goes on it seems less and less 'interim'.
>
> I may well implement a filesystem that has this sort of
> functionality.  I'm very unlikely to implement it in the md layer.
> But you never know what will happen...

Zfs already does this. btrfs does it but only with raid1. But I find
that zfs doesn't really integrate the two, it just has the raid and
filesystem layer in a single binary but still as 2 seperate layers.
Makes changing the layout inflexible, e.g. you can't grow from 4 to 5
disks per stripe.

> Thanks for the thoughts.
>
> NeilBrown

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html