Re: About the md-bitmap behavior

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Wed, 22 Jun 2022 10:37:29 +0800

On 2022/6/22 10:15, Doug Ledford wrote:
On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote:
On 20/06/2022 08:56, Qu Wenruo wrote:
The write-hole has been addressed with journaling already, and
this will
be adding a new and not-needed feature - not saying it wouldn't be
nice
to have, but do we need another way to skin this cat?

I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a
completely different thing.

Here I'm just trying to understand how the md-bitmap works, so that
I
can do a proper bitmap for btrfs RAID56.

Ah. Okay.

Neil Brown is likely to be the best help here as I believe he wrote a
lot of the code, although I don't think he's much involved with md-
raid
any more.

I can't speak to how it is today, but I know it was *designed* to be
sync flush of the dirty bit setting, then lazy, async write out of the
clear bits.  But, yes, in order for the design to be reliable, you must
flush out the dirty bits before you put writes in flight.

Thank you very much confirming my concern.

So maybe it's me not checking the md-bitmap code carefully enough to
expose the full picture.

One thing I'm not sure about though, is that MD RAID5/6 uses fixed
stripes.  I thought btrfs, since it was an allocation filesystem, didn't
have to use full stripes?  Am I wrong about that?

Unfortunately, we only go allocation for the RAID56 chunks. In side a
RAID56 the underlying devices still need to go the regular RAID56 full
stripe scheme.

Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs
RAID56 chunk, but without bitmap/journal.

 Because it would seem
that if your data isn't necessarily in full stripes, then a bitmap might
not work so well since it just marks a range of full stripes as
"possibly dirty, we were writing to them, do a parity resync to make
sure".

For the resync part is where btrfs shines, as the extra csum (for the
untouched part) and metadata COW ensures us only see the old untouched
data, and with the extra csum, we can safely rebuild the full stripe.

Thus as long as no device is missing, a write-intent-bitmap is enough to
address the write hole in btrfs (at least for COW protected data and all
metadata).

In any case, Wols is right, probably want to ping Neil on this.  Might
need to ping him directly though.  Not sure he'll see it just on the
list.

Adding Neil into this thread. Any clue on the existing
md_bitmap_startwrite() behavior?

Thanks,
Qu