On 2022/6/23 11:32, Song Liu wrote:
On Wed, Jun 22, 2022 at 5:39 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
[...]
E.g.
btrfs uses 64KiB as stripe size.
O = Old data
N = New writes
0 32K 64K
D1 |OOOOOOO|NNNNNNN|
D2 |NNNNNNN|OOOOOOO|
P |NNNNNNN|NNNNNNN|
In above case, no matter if the new write reaches disks, as long as the
crash happens before we update all the metadata and superblock (which
implies a flush for all involved devices), the fs will only try to read
the old data.
I guess we are using "write hole" for different scenarios. I use "write hole"
for the case that we corrupt data that is not being written to. This happens
with the combination of failed drive and power loss. For example, we have
raid5 with 3 drives. Each stripe has two data and one parity. When D1
failed, read to D1 is calculated based on D2 and P; and write to D1
requires updating D2 and P at the same time. Now imagine we lost
power (or crash) while writing to D2 (and P). When the system comes back
after reboot, D2 and P are out of sync. Now we lost both D2 and D1. Note
that D1 is not being written to before the power loss.
For that powerloss + device lose case, journal is the only way to go,
unless we do extra work to avoid partial write.
For btrfs, maybe we can avoid write hole by NOT writing to D2 when D1
contains valid data (and the drive is failed). Instead, we can write a new
version of D1 and D2 to a different stripe. If we loss power during the write,
the old data is not corrupted. Does this make sense? I am not sure
whether it is practical in btrfs though.
That makes sense, but that also means the extent allocator needs extra
info, not just which space is available.
And there would make ENOSPC handling even more challenging, what if we
have no space left but only partially written stripes?
There are some ideas, like extra layer for RAID56 to do extra mapping
between logical address to physical address, but I'm not yet confident
if we will see new (and even more complex) challenges going that path.
So at this point, our data read on old data is still correct.
But the parity no longer matches, thus degrading our ability to tolerate
device lost.
With write-intent bitmap, we know this full stripe has something out of
sync, so we can re-calculate the parity.
Although, all above condition needs two things:
- The new write is CoWed.
It's mandatory for btrfs metadata, so no problem. But for btrfs data,
we can have NODATACOW (also implies NDOATASUM), and in that case,
corruption will be unavoidable.
- The old data should never be changed
This means, the device can not disappear during the recovery.
If powerloss + device missing happens, this will not work at all.
You must either:
1/ have a safe duplicate of the blocks being written, so they can be
recovered and re-written after a crash. This is what journalling
does. Or
Yes, journal would be the next step to handle NODATACOW case and device
missing case.
2/ Only write to location which don't contain valid data. i.e. always
write full stripes to locations which are unused on each device.
This way you cannot lose existing data. Worst case: that whole
stripe is ignored. This is how I would handle RAID5 in a
copy-on-write filesystem.
That is something we considered in the past, but considering even now we
still have space reservation problems sometimes, I doubt such change
would cause even more problems than it can solve.
However, I see you wrote:
Thus as long as no device is missing, a write-intent-bitmap is enough to
address the write hole in btrfs (at least for COW protected data and all
metadata).
That doesn't make sense. If no device is missing, then there is no
write hole.
If no device is missing, all you need to do is recalculate the parity
blocks on any stripe that was recently written.
That's exactly what we need and want to do.
I guess the goal is to find some files after crash/power loss. Can we
achieve this with file mtime? (Sorry if this is a stupid question...)
There are two problems here:
1. After power loss, we won't see the mtime update at all.
As the mtime update will be protected by metadata CoW, since the
powerloss happens when the current transaction is not committed,
we will only see the old metadata after recovery.
Thus to btrfs, at next reboot, it can not see the new mtime at all.
AKA everything CoWed is updated atomically, we can only see
trnsaction last committed.
Although there is something special like log tree for fsync(), it has
its own limitation (can not happen across transaction boundary), thus
still not suitable for things like random data write.
2. Will not work for metadata, unless we scrub the whole metadata at
recovery time
The core problem here is graduality.
The target file can be in TiB size, or for the whole metadata.
Scrubbing such large range before allowing user to do any write can
lead to super unhappy end users.
So for now, as a (kinda) quick solution, I'd like to go write-intent
bitmap first, then journal, just like md-raid.
Thanks,
Qu
Thanks,
Song