Hello all
(especially Song Liu and Shaohua Li),
there has just been a thread on raid5cache so I thought about writing.
I have read the code of raid5cache a bit and I would have a few ideas,
bugfix / feature requests let's say.
In order of decreasing importance:
1- Would you fix this?
https://www.spinics.net/lists/raid/msg61331.html
raid5-cache: deeply broken (with write-back?)
Is it fixed by the following patch?
https://www.spinics.net/lists/raid/msg60713.html
but it's currently not applied upstream in latest v4.20.4
The bug is serious (raid unmountable) also because the writeback cache
can be enormous and partially full and currently does not write back
completely during idle times (see point "Write back during idle times"
below)
2- Workaround for liar disks
You know, many disks lie about flush, especially SSDs. This easily
corrupts a RAID array because the various members of the array have a
different idea of the last writes which happened.
Testing for liar disks is very difficult (somewhat feasible with
diskchecker.pl from Postgresql) and no hardware review website currently
does that unfortunately.
Lying can happen on both the cache disks and the RAID disks.
Lying by the cache disks probably cannot be worked around from here, but
lying by the RAID disks could.
There should be an additional pointer to the log (replay_ptr) which
stays at least XX MB behind and at least YY seconds (both configurable)
behind the last flush to the RAID disks, and such area of the log, from
replay_ptr to the last flush, should still be considered occupied and
cannot be overwritten/reclaimed.
In this way, if there is a power loss, the area from the replay_ptr
onwards will eventually be replayed to the RAID disks.
I guess battery backed RAID controllers do something like this, as they
are known to usually fix the liar disks problem.
3- Write back during idle times
It seems to me that with current code the cache will forever stay
not-empty even in case of low amounts of writes.
The raid5cache apparently does not leverage moments of idleness to
writeback (clean) itself completely to the array AFAICS...
You might want to leverage those moments, because raid5cache is
apparently not able to coalesce random writes from distant points of the
cache anyway, so there is no point in waiting.
If there are random writes around sector 10000, and then other writes
elsewhere, and then after some time some more random writes around
sector 10000, it seems to me raid5cache is not able to coalesce the two
groups of random writes around sector 10000, so it probably makes sense
to write back the first group of random writes as soon as there is idle
time, no?
The current situation greatly worsens the case of cache disks lost,
which I know is normally regarded as catastrophic, but could be "less
catastrophic" anyway, and can happen even due to a software bug, such as
point #1 above.
Thanks for your work
N.B.