Re: RAID-6 and write hole with write-intent bitmap

Mukund Sivaraman <muks@xxxxxxxxxx> · Wed, 25 Nov 2020 00:20:04 +0530

Hi Wols

On Tue, Nov 24, 2020 at 10:10:32AM +0000, Wols Lists wrote:
> On 24/11/20 07:20, Mukund Sivaraman wrote:
> > Hi all
> > 
> > I am trying to setup a MD RAID-6 array and use the ext4 filesystem in
> > ordered mode (default) on it. The data gets backed up periodically. I
> > want the array to be always available.
> > 
> > I prefer not using a write-journal if it is sufficient for my usage. I
> > want to use the write-intent bitmap only. AIUI the write-hole problem
> > occurs when there is a crash or abrupt power off *and* disk failures.
> 
> No, I don't think so. I'm not sure, but aiui, there is a critical point
> where the data is partially saved to disk, and should a power failure
> occur at that precise point you have a stripe incompletely saved, and
> therefore corrupt. This is why you need a log to fix it ...

I appreciate that you took time to reply. Thank you. I am also in the
"not sure" group, and we may be served well by an authoritative answer
from someone who is familiar with the code. I also didn't follow whether
you're saying there is a write hole or not. The answer may be
implementation specific too, so I am looking for an answer from someone
who knows the code.

The following may be incorrect as I am a RAID layperson, but AIUI:

(a) With RAID-5, assuming there are 4 member disks A, B, C, D, a write
operation with its data on disk A and stripe's parity on disk B may
involve:

1. a read of the stripe
2. update of data on A
3. computation and update of parity A^C^D on B

These are not atomic updates. If power is lost between steps 2 and 3,
upon recovery the mismatch between data and parity for the stripe would
be found and the parity can be updated on B. The data chunk written to A
may be incomplete if power is lost during step 2, but the ext4's journal
would return the FS to a consistent state. Moreover, there should not be
any modification/corruption of data in the stripe on disks C and D
(assuming the disks are OK).

(b) With RAID-6, assuming there are 5 member disks A, B, C, D, E, a
write operation with its data on disk A and stripe's parity on disks
B(p) and C(q) would involve:

1. a read of the stripe
2. update of data on A
3. computation and update of parity on B(p)
4. update of parity on C(q)

These are not atomic updates. If power is lost between steps 2 and 3,
upon recovery the mismatch of data on A would be found and the data
chunk can be updated on A. The data chunk written to A may be incomplete
if power is lost during step 2, but the ext4's journal would return the
FS to a consistent state. If power is lost between steps 3 and 4, upon
recovery the mismatch of parity would be found between B(p) and C(q) and
the parity can be updated on B(p) and C(q). Mainly, there should not be
any modification/corruption of data in the stripe on disks D and E
(assuming the disks are OK).

The above may be incorrect, so please indicate what happens, and if
there is a write hole, why there is one.

We don't mind if data in files being written to at the time of power
loss are partially written. It can happen with any abrupt power
loss. The concern is if other unrelated parts of the filesystem not
tracked by the filesystem's journal get corrupted because of other data
chunks of a stripe being updated during recovery.

> > 
> > * After a crash or abrupt power off, the write-intent bitmap is used to
> >   rewrite parity where necessary. If there is no disk failure during
> >   this period, is the RAID-6 array guaranteed to recover without
> >   corruption?
> > 
> >   With RAID-6, will recovery with write-intent bitmap succeed with 1
> >   disk failure during the recovery period without a write-journal? i.e.,
> >   is there a possibility of write hole with 1 disk failure in a RAID-6
> >   array?
> > 
> > * With RAID-6 with write-intent bitmap in use, ext4 in ordered mode, no
> >   disk failures, and abrupt power loss, is there any chance of data loss
> >   in files other than those being written to just before the power loss?
> 
> Probably. Sod's law, you will have other files on the same stripe and
> things could go wrong ... Plus I believe some file systems (including
> ext4?) store small files in the directory, not as their own i-node, so
> there's a whole bunch of other complications possible, plus if you
> corrupt the directory ,,,
> > 
> > (Apologies if these are silly questions, but I request answers.)
> > 
> RULE 0: RAID IS NO SUBSTITUTE FOR BACKUPS.

The data is backed up periodically.

> And if you don't want to lose live data as it is being updated, you need
> a journal. Run the correct horse for the course :-)

It is important that the array is available and not in a failed state or
with an corrupted FS due to a power loss. We would also like to avoid
having to go to restoring from backups as much as possible. There is
power outage about once every week. The system is powered via an
inverter (a lead-acid battery backed UPS) which switches from mains to
battery power when there is power loss within a few tens of milliseconds
that the server's power supply tolerates. Rarely, the switchover time is
longer, or there is a dip, and the server powers off. So consider that
power outages are somewhat common and the array should survive it to
avoid extra work for us, regardless of backups.

The write-journal is a relatively new addition to MD and I feel
conservative about using it for now. I have come across failures
reported on the lists[1], it is not clear if others are using it in
production, and some things such as how to remove the write journal from
an array are not documented (there was a sequence of steps which was
mentioned in the commit log of a patch that introduced support[2], but a
step was missing in it as pointed out in a different mailing list
post[3]). Please don't take these things as criticism - it is just that
the feature appears to be relatively new. Adding an NVMe SSD to hold the
write journal would add another component to the mix which I want to
avoid. However if an authoritative answer indicates the write journal is
required in our case and the implementation is mature, we will try to
adopt it.

[1] https://www.spinics.net/lists/raid/msg62646.html
[2] https://marc.info/?l=linux-raid&m=149063896208043
[3] https://www.spinics.net/lists/raid/msg59940.html

		Mukund