Hi Nik, On Wed, Jan 23, 2019 at 10:58 PM Nik.Brt. <nik.brt@xxxxxxxxxxxxx> wrote: > > Hello all > (especially Song Liu and Shaohua Li), > there has just been a thread on raid5cache so I thought about writing. > I have read the code of raid5cache a bit and I would have a few ideas, > bugfix / feature requests let's say. > In order of decreasing importance: > > 1- Would you fix this? > https://www.spinics.net/lists/raid/msg61331.html > raid5-cache: deeply broken (with write-back?) > Is it fixed by the following patch? > https://www.spinics.net/lists/raid/msg60713.html > but it's currently not applied upstream in latest v4.20.4 > The bug is serious (raid unmountable) also because the writeback cache > can be enormous and partially full and currently does not write back > completely during idle times (see point "Write back during idle times" > below) Thanks for bringing this back. We didn't make enough progress because Shaohua was very sick back then. I think the fix would work. I will pick it up from here. > > > 2- Workaround for liar disks > You know, many disks lie about flush, especially SSDs. This easily > corrupts a RAID array because the various members of the array have a > different idea of the last writes which happened. > Testing for liar disks is very difficult (somewhat feasible with > diskchecker.pl from Postgresql) and no hardware review website currently > does that unfortunately. > Lying can happen on both the cache disks and the RAID disks. > Lying by the cache disks probably cannot be worked around from here, but > lying by the RAID disks could. > There should be an additional pointer to the log (replay_ptr) which > stays at least XX MB behind and at least YY seconds (both configurable) > behind the last flush to the RAID disks, and such area of the log, from > replay_ptr to the last flush, should still be considered occupied and > cannot be overwritten/reclaimed. > In this way, if there is a power loss, the area from the replay_ptr > onwards will eventually be replayed to the RAID disks. > I guess battery backed RAID controllers do something like this, as they > are known to usually fix the liar disks problem. Could you please be more specific on how the RAID disk lies about flush? Does it "claim flush is done, before fully flush volatile cache"? If this is the case, I think we really cannot guarantee the data is secure, as it is hard to determine the proper XX MB and YY seconds value. > > > 3- Write back during idle times > It seems to me that with current code the cache will forever stay > not-empty even in case of low amounts of writes. > The raid5cache apparently does not leverage moments of idleness to > writeback (clean) itself completely to the array AFAICS... > You might want to leverage those moments, because raid5cache is > apparently not able to coalesce random writes from distant points of the > cache anyway, so there is no point in waiting. > If there are random writes around sector 10000, and then other writes > elsewhere, and then after some time some more random writes around > sector 10000, it seems to me raid5cache is not able to coalesce the two > groups of random writes around sector 10000, so it probably makes sense > to write back the first group of random writes as soon as there is idle > time, no? > The current situation greatly worsens the case of cache disks lost, > which I know is normally regarded as catastrophic, but could be "less > catastrophic" anyway, and can happen even due to a software bug, such as > point #1 above. This is a good idea. Proactive writing back could also benefit p99 latency of random write cases. However, this requires some serious development work, which I don't have bandwidth in short term. How about we fix #1 first, and see whether #3 is still urgent? Thanks, Song > > > Thanks for your work > N.B. >