Re: Bugfix / feature requests for raid5cache (writeback)

"Nik.Brt." <nik.brt@xxxxxxxxxxxxx> · Fri, 25 Jan 2019 08:03:35 +0100

Hi Song,

On 24/01/2019 19:29, Song Liu wrote:
>>
>> https://www.spinics.net/lists/raid/msg61331.html
>> https://www.spinics.net/lists/raid/msg60713.html
>> but it's currently not applied upstream in latest v4.20.4
>
> Thanks for bringing this back. We didn't make enough progress because
> Shaohua was very sick back then.
>
> I think the fix would work. I will pick it up from here.

Oooh I'm so sorry... I just realized what happened to Shaohua.
That's so bad :-(((((
May he rest in peace.....

>>
>> 2- Workaround for liar disks
>> [...]
>> There should be an additional pointer to the log (replay_ptr) which
>> stays at least XX MB behind and at least YY seconds (both configurable)
>> behind the last flush to the RAID disks
>
> Could you please be more specific on how the RAID disk lies about
> flush? Does it "claim flush is done, before fully flush volatile cache"?
> If this is the case, I think we really cannot guarantee the data is
> secure, as it is hard to determine the proper XX MB and YY seconds
> value.

Yes, so, this is a "kinda known" problem in the storage topics.
Due to the fact that the hardware review websites always test 
performances and never test for consistency problems, the HDD and SSD 
brands have always had this oh-so-smart idea of lying about the flush 
being performed.
They return the flush before the data is on stable media.

Not all disks do that, but the majority.
It is difficult to tell exactly what they do because it's implementation 
dependent, it depends on the brand, and its undisclosed.
Between totally ignoring the flush command and totally respecting the 
flush command there are many possibilities in between.

People who want to do serious storage traditionally have often chosen an 
hardware RAID controller with battery backed cache: things seem to 
improve a lot in that case, problems are basically fixed, and the most 
likely interpretation of this is that disks lie only up to a certain 
extent, and will not postpone/reorder a write indefinitely.

For example they might perform periodic barriers, or convert an incoming 
flush request to a barrier so that writes are not reordered across that.

The best known method to check if the disks respect the flush is the 
diskchecker.pl by the postgresql people. Look it up on the internet and 
you will see tons of discussions about the need to test with that for 
serious storage, in particular for postgresql / databases (and I say 
also RAID).

That's why I suggested that you to make raid5cache behave like a battery 
backed linear cache: you writeback linearly starting from the oldest 
requests, but you keep two pointers: one where you are issuing the 
writeback, and another more behind called replay_ptr: log space ahead of 
this point still cannot be reclaimed. The distance between the two could 
be measured in seconds and/or in megabytes: still I am not sure which of 
the two is better, probably the worst of the two values would have to be 
used. The two values should be configurable because the needed value 
probably depends on the implementation of the disks.

I see that the code of raid5cache goes to great lengths so to first 
writeback the full stripes, however I'm not sure of why the code does 
that, because it doesn't seem to me that you can reclaim log space 
out-of-order, that's a linear space, right? So why don't you simply 
flush in-order starting from the oldest request? A in-order writeback is 
also simpler to implement. If you do it out of order my idea is probably 
impossible to be implemented.

SSDs also lie, there are even cases known of a brand which has 
supercapacitors, which also claims to have power loss data protection 
for data at-rest (which would be the flushed data) but in fact still 
lies about the flush being performed.

There are also some disks which don't lie and don't have 
supercapacitors: it appears that the Samsung "PRO" line (still consumer 
and no supercap) doesn't lie (EVO unknown, I didn't find discussions), 
if you search for "samsung diskchecker.pl" there are a few people who 
tested that one. Most SSDs lie. Other known to respect flushes are the 
Intel brand *in those for datacenter WITH supercapacitors*.

The disks which lie usually lie in a way that do not corrupt filesystems 
in the simplest case: a filesystem on a single disk. Maybe they 
transform a flush into a barrier internally. However in RAID things are 
very different, because upon power loss two disks of the set can "roll 
back" to a different point in time, at that point RAIDed data is 
screwed, even with raid1, even with bitmap active (bitmap also is taken 
from a single disk and would be wrong), and for raid5/6 the damages 
would probably be more serious.

Clearly the user needs non-lying disks at least for the backend device 
for raid5cache. Such requirement cannot easily be worked around with code.

And this would fix only raid5/6: still there would be no workaround for 
raid1 and raid10.

>>
>> 3- Write back during idle times
>> It seems to me that with current code the cache will forever stay
>> not-empty even in case of low amounts of writes.
>> The raid5cache apparently does not leverage moments of idleness to
>> writeback (clean) itself completely to the array AFAICS...
>> [...]
>
> This is a good idea. Proactive writing back could also benefit p99 
latency
> of random write cases. However, this requires some serious development
> work, which I don't have bandwidth in short term. How about we fix #1 
first,
> and see whether #3 is still urgent?

Sure, sure, and thanks a lot

Also I wanted to say that a very common use case when one has SSDs is to 
put the RAID array (including raid5cache) behind a bcache. Bcache takes 
care of most random io, so a linear writeback by raid5cache is probably 
acceptable.

Thank you

N.B.