On Wed, Jun 22, 2022 at 3:33 PM NeilBrown <neilb@xxxxxxx> wrote: > > On Wed, 22 Jun 2022, Qu Wenruo wrote: > > > > On 2022/6/22 10:15, Doug Ledford wrote: > > > On Mon, 2022-06-20 at 10:56 +0100, Wols Lists wrote: > > >> On 20/06/2022 08:56, Qu Wenruo wrote: > > >>>> The write-hole has been addressed with journaling already, and > > >>>> this will > > >>>> be adding a new and not-needed feature - not saying it wouldn't be > > >>>> nice > > >>>> to have, but do we need another way to skin this cat? > > >>> > > >>> I'm talking about the BTRFS RAID56, not md-raid RAID56, which is a > > >>> completely different thing. > > >>> > > >>> Here I'm just trying to understand how the md-bitmap works, so that > > >>> I > > >>> can do a proper bitmap for btrfs RAID56. > > >> > > >> Ah. Okay. > > >> > > >> Neil Brown is likely to be the best help here as I believe he wrote a > > >> lot of the code, although I don't think he's much involved with md- > > >> raid > > >> any more. > > > > > > I can't speak to how it is today, but I know it was *designed* to be > > > sync flush of the dirty bit setting, then lazy, async write out of the > > > clear bits. But, yes, in order for the design to be reliable, you must > > > flush out the dirty bits before you put writes in flight. > > > > Thank you very much confirming my concern. > > > > So maybe it's me not checking the md-bitmap code carefully enough to > > expose the full picture. > > > > > > > > One thing I'm not sure about though, is that MD RAID5/6 uses fixed > > > stripes. I thought btrfs, since it was an allocation filesystem, didn't > > > have to use full stripes? Am I wrong about that? > > > > Unfortunately, we only go allocation for the RAID56 chunks. In side a > > RAID56 the underlying devices still need to go the regular RAID56 full > > stripe scheme. > > > > Thus the btrfs RAID56 is still the same regular RAID56 inside one btrfs > > RAID56 chunk, but without bitmap/journal. > > > > > Because it would seem > > > that if your data isn't necessarily in full stripes, then a bitmap might > > > not work so well since it just marks a range of full stripes as > > > "possibly dirty, we were writing to them, do a parity resync to make > > > sure". > > > > For the resync part is where btrfs shines, as the extra csum (for the > > untouched part) and metadata COW ensures us only see the old untouched > > data, and with the extra csum, we can safely rebuild the full stripe. > > > > Thus as long as no device is missing, a write-intent-bitmap is enough to > > address the write hole in btrfs (at least for COW protected data and all > > metadata). > > > > > > > > In any case, Wols is right, probably want to ping Neil on this. Might > > > need to ping him directly though. Not sure he'll see it just on the > > > list. > > > > > > > Adding Neil into this thread. Any clue on the existing > > md_bitmap_startwrite() behavior? > > md_bitmap_startwrite() is used to tell the bitmap code that the raid > module is about to start writing at a location. This may result in > md_bitmap_file_set_bit() being called to set a bit in the in-memory copy > of the bitmap, and to make that page of the bitmap as BITMAP_PAGE_DIRTY. > > Before raid actually submits the writes to the device it will call > md_bitmap_unplug() which will submit the writes and wait for them to > complete. > > The is a comment at the top of md/raid5.c titled "BITMAP UNPLUGGING" > which says a few things about how raid5 ensure things happen in the > right order. > > However I don't think if any sort of bitmap can solve the write-hole > problem for RAID5 - even in btrfs. > > The problem is that if the host crashes while the array is degraded and > while some write requests were in-flight, then you might have lost data. > i.e. to update a block you must write both that block and the parity > block. If you actually wrote neither or both, everything is fine. If > you wrote one but not the other then you CANNOT recover the data that > was on the missing device (there must be a missing device as the array > is degraded). Even having checksums of everything is not enough to > recover that missing block. > > You must either: > 1/ have a safe duplicate of the blocks being written, so they can be > recovered and re-written after a crash. This is what journalling > does. Or > 2/ Only write to location which don't contain valid data. i.e. always > write full stripes to locations which are unused on each device. > This way you cannot lose existing data. Worst case: that whole > stripe is ignored. This is how I would handle RAID5 in a > copy-on-write filesystem. Thanks Neil for explaining this. I was about to say the same idea, but couldn't phrase it well. md raid5 suffers from write hole because the mapping from array-LBA to component-LBA is fixed. As a result, we have to update the data in place. btrfs already has file-to-LBA mapping, so it shouldn't be too expensive to make btrfs free of write hole. (no need for maintain extra mapping, or add journaling). Thanks, Song > > However, I see you wrote: > > Thus as long as no device is missing, a write-intent-bitmap is enough to > > address the write hole in btrfs (at least for COW protected data and all > > metadata). > > That doesn't make sense. If no device is missing, then there is no > write hole. > If no device is missing, all you need to do is recalculate the parity > blocks on any stripe that was recently written. In md with use the > write-intent-bitmap. In btrfs I would expect that you would already > have some way of knowing where recent writes happened, so you can > validiate the various checksums. That should be sufficient to > recalculate the parity. I've be very surprised if btrfs doesn't already > do this. > > So I'm somewhat confuses as to what your real goal is. > > NeilBrown