Re: Any benefity to write intent bitmaps on Raid1

Goswin von Brederlow <goswin-v-b@xxxxxx> · Sat, 11 Apr 2009 10:46:40 +0200

Neil Brown <neilb@xxxxxxx> writes:

> On Saturday April 11, goswin-v-b@xxxxxx wrote:
>> Neil Brown <neilb@xxxxxxx> writes:
>> 
>> > On Thursday April 9, goswin-v-b@xxxxxx wrote:
>> >> Neil Brown <neilb@xxxxxxx> writes:
>> >> 
>> >> > (*) I've been wondering about adding another bitmap which would record
>> >> > which sections of the array have valid data.  Initially nothing would
>> >> > be valid and so wouldn't need recovery.  Every time we write to a new
>> >> > section we add that section to the 'valid' sections and make sure that
>> >> > section is in-sync.
>> >> > When a device was replaced, we would only need to recover the parts of
>> >> > the array that are known to be invalid.
>> >> > As filesystem start using the new "invalidate" command for block
>> >> > devices, we could clear bits for sections that the filesystem says are
>> >> > not needed any more...
>> >> > But currently it is just a vague idea.
>> >> >
>> >> > NeilBrown
>> >> 
>> >> If you are up for experimenting I would go for a completly new
>> >> approach. Instead of working with physical blocks and marking where
>> >> blocks are used and out of sync how about adding a mapping layer on
>> >> the device and using virtual blocks. You reduce the reported disk size
>> >> by maybe 1% to always have some spare blocks and initialy all blocks
>> >> will be unmapped (unused). Then whenever there is a write you pick out
>> >> an unused block, write to it and change the in memory mapping of the
>> >> logical to physical block. Every X seconds, on a barrier or an sync
>> >> you commit the mapping from memory to disk in such a way that it is
>> >> synchronized between all disks in the raid. So every commited mapping
>> >> represents a valid raid set. After the commit of the mapping all
>> >> blocks changed between the mapping and the last can be marked as free
>> >> again. Better use the second last so there are always 2 valid mappings
>> >> to choose from after a crash.
>> >> 
>> >> This would obviously need a lot more space than a bitmap but space is
>> >> (relatively) cheap. One benefit imho should be that sync/barrier would
>> >> not have to stop all activity on the raid to wait for the sync/barrier
>> >> to finish. It just has to finalize the mapping for the commit and then
>> >> can start a new in memory mapping while the finalized one writes to
>> >> disk.
>> >
>> > While there is obviously real value in this functionality, I can't
>> > help thinking that it belongs in the file system, not the block
>> > device.
>> 
>> I believe it is the only way to actualy remove the race conditions
>> inherent in software raid and there are some uses that don't work well
>> with a filesystem. E.g. creating a filesystem with only a swapfile on
>> it instead of using a raid device seems a bit stupid. Or for databases
>> that use block devices.
>
> I agree that it would remove some races, make resync unnecessary, and
> thus remove the small risk of data loss when a system with a degraded
> raid5 crashes.  I doubt it is the only way, and may not even be a good
> way, though I'm not certain.

Ok, not the only way. You could have a journal where you first write
what block and data is to be updated, sync, and then write the data to
the actual block. After a crash the journal could just be replayed.

> Your mapping of logical to physical blocks - it would technically need
> to map each sector independently, but let's be generous (and fairly
> realistic) and map each 4K block independently.
> Then with a 1TB device, you have 2**28 entries in the table, each 4
> bytes, so 2**30 bytes, or 1 gigabyte.
> You suggest this table is kept in memory.  While memory is cheap, I
> don't think it is that cheap yet.
> So you would need to make compromises, either not keeping it all in
> memory, or having larger block sizes (and so needing to pre-read for
> updates), or having a more complicated data structure.  Or, more
> likely, all of the above.

Plus as a plain array you would have to have multiple copies of 1GB. A
BTree where only used parts are in memory or something similar would
really ne neccessary. Mapping extends instead of individual blocks
would also be usefull as well as a defrager that remaps blocks into
larger continious segments. But now it really got complex.

> You could make it work, but there would be a performance hit.
>
> Now look at your cases where a filesystem doesn't work well:
>  1/ Swap.  That is a non-issue.  After a crash, the contents of swap
>     are irrelevant.  Without a crash, the races you refer to are
>     irrelevant.

What about suspend to swap?

>  2/ Database that use block devices directly.   Why do they use the
>     block device directly rather than using O_DIRECT to a
>     pre-allocated file?  Because they believe that the filesystem
>     introduces a performance penalty.  What reason is there to believe
>     that the performance penalty of your remapped-raid would
>     necessarily be less than that of a filesystem?  I cannot see one.

Youareassuming we could change the DB to use files instead. :)

> BTW an alternate approach to closing those races (assuming that I am
> understanding you correctly) is to journal all updates to a separate
> device.  Possible an SSD or battery-backed RAM.  That could have the
> added benefit of reducing latency, though it may impact throughput.
> I'm not sure if that is an approach with a real future either.  But it
> is a valid alternate.

That is what hardware raids do.

>> > But then I've always seen logical volume management as an interim hack
>> > until filesystems were able to span multiple volumes in a sensible
>> > way.  As time goes on it seems less and less 'interim'.
>> >
>> > I may well implement a filesystem that has this sort of
>> > functionality.  I'm very unlikely to implement it in the md layer.
>> > But you never know what will happen...
>> 
>> Zfs already does this. btrfs does it but only with raid1. But I find
>> that zfs doesn't really integrate the two, it just has the raid and
>> filesystem layer in a single binary but still as 2 seperate layers.
>> Makes changing the layout inflexible, e.g. you can't grow from 4 to 5
>> disks per stripe.
>
> I thought ZFS was more integrated than that, but I haven't looked
> deeply.
> My vague notion what that when ZFS wanted to write "some data" it
> would break it into sets of N blocks. calculate a parity block for
> each N, then write those N+1 blocks to N+1 different devices,
> where-ever there happened to be unused space.  Then the addresses of
> those N+1 block would be stored in the file metadata which would be
> written a similar way, possibly with a different(smaller) N.
>
> This idea (which might be completely wrong) implies very tight
> integration between the layers.

But first you define a storage pool form segments X devices with a
certain raid level. The higher level then uses virtual addresses into
that pool. If you want to grow your zfs you have to add new disks and
create a new pool from them. All the docs I've seen didn't mention any
support for changing an existing pool.

> With this setup you could conceivably change the default N at any
> time.  Old data wouldn't be relocated, but new writes would be written
> with the new N.  If you have a background defragmentation process, it
> could, over a period of time, arrange for the whole filesystem to be
> re-laid out with the new N.

As I understand it the pool creates a virtual->physical mapping and the
higher layers use the virtual address. By increasing the number of
disks in a pool all physical addresses would change, just like when
growing a raid, and the hiher layers would have to readjust their
addresses. At least that is my understanding.

> Clearly data would still be recoverable after a single drive failure.
>
> The problem I see with this approach is the cost of recovering to a
> hot-spare after device failure.  Finding which blocks need to be
> written where would require scanning all the metadata on the entire
> filesystem.  And much of this would not be contiguous.  So much
> seeking would be involved.  I wouldn't be surprised if recovering a
> device in a nearly-full filesystem took an order of magnitude longer
> with that approach than with md style raid.

One huge improvement comes from splitting data and metadata into
seperate segments thereby keeping the metadata close together. If one
also takes care to write the parent of a metablock before its child
and defrags them frequently they should be kept pretty linear.

And how much metadata is there in the filesystem? My 4.6TB movie
archive has 30000 inodes used so that would be a few MB of
metadata. Hardly relevant. For a news spool it would look different.

> Given that observation: maybe I am wrong about RAID-Z.  However it is
> the only model I can come up with that matches the various snippets I
> have heard about it.
>
> (hmm... maybe a secondary indexing scheme could help... might get it
> down to taking only twice as long, with could be acceptable ....
> maybe I will try implementing that after all and see how it
> works... in my spare time)
>
> NeilBrown

The snippets I've read about zfs let me to believe that the raid level
is restricted to the pools. So in effect you just have lots of
internal md devices. Resync speed in zfs should be exactly like
normal raid.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html