Bugfix / feature requests for raid5cache (writeback)

"Nik.Brt." <nik.brt@xxxxxxxxxxxxx> · Thu, 24 Jan 2019 07:49:46 +0100

Hello all
(especially Song Liu and Shaohua Li),
there has just been a thread on raid5cache so I thought about writing.
I have read the code of raid5cache a bit and I would have a few ideas, 
bugfix / feature requests let's say.
In order of decreasing importance:

1- Would you fix this?
https://www.spinics.net/lists/raid/msg61331.html
raid5-cache: deeply broken (with write-back?)
Is it fixed by the following patch?
https://www.spinics.net/lists/raid/msg60713.html
but it's currently not applied upstream in latest v4.20.4
The bug is serious (raid unmountable) also because the writeback cache 
can be enormous and partially full and currently does not write back 
completely during idle times (see point "Write back during idle times" 
below)

2- Workaround for liar disks
You know, many disks lie about flush, especially SSDs. This easily 
corrupts a RAID array because the various members of the array have a 
different idea of the last writes which happened.
Testing for liar disks is very difficult (somewhat feasible with 
diskchecker.pl from Postgresql) and no hardware review website currently 
does that unfortunately.
Lying can happen on both the cache disks and the RAID disks.
Lying by the cache disks probably cannot be worked around from here, but 
lying by the RAID disks could.
There should be an additional pointer to the log (replay_ptr) which 
stays at least XX MB behind and at least YY seconds (both configurable) 
behind the last flush to the RAID disks, and such area of the log, from 
replay_ptr to the last flush, should still be considered occupied and 
cannot be overwritten/reclaimed.
In this way, if there is a power loss, the area from the replay_ptr 
onwards will eventually be replayed to the RAID disks.
I guess battery backed RAID controllers do something like this, as they 
are known to usually fix the liar disks problem.

3- Write back during idle times
It seems to me that with current code the cache will forever stay 
not-empty even in case of low amounts of writes.
The raid5cache apparently does not leverage moments of idleness to 
writeback (clean) itself completely to the array AFAICS...
You might want to leverage those moments, because raid5cache is 
apparently not able to coalesce random writes from distant points of the 
cache anyway, so there is no point in waiting.
If there are random writes around sector 10000, and then other writes 
elsewhere, and then after some time some more random writes around 
sector 10000, it seems to me raid5cache is not able to coalesce the two 
groups of random writes around sector 10000, so it probably makes sense 
to write back the first group of random writes as soon as there is idle 
time, no?
The current situation greatly worsens the case of cache disks lost, 
which I know is normally regarded as catastrophic, but could be "less 
catastrophic" anyway, and can happen even due to a software bug, such as 
point #1 above.

Thanks for your work
N.B.