Hi Song,
On 24/01/2019 19:29, Song Liu wrote:
>>
>> https://www.spinics.net/lists/raid/msg61331.html
>> https://www.spinics.net/lists/raid/msg60713.html
>> but it's currently not applied upstream in latest v4.20.4
>
> Thanks for bringing this back. We didn't make enough progress because
> Shaohua was very sick back then.
>
> I think the fix would work. I will pick it up from here.
Oooh I'm so sorry... I just realized what happened to Shaohua.
That's so bad :-(((((
May he rest in peace.....
>>
>> 2- Workaround for liar disks
>> [...]
>> There should be an additional pointer to the log (replay_ptr) which
>> stays at least XX MB behind and at least YY seconds (both configurable)
>> behind the last flush to the RAID disks
>
> Could you please be more specific on how the RAID disk lies about
> flush? Does it "claim flush is done, before fully flush volatile cache"?
> If this is the case, I think we really cannot guarantee the data is
> secure, as it is hard to determine the proper XX MB and YY seconds
> value.
Yes, so, this is a "kinda known" problem in the storage topics.
Due to the fact that the hardware review websites always test
performances and never test for consistency problems, the HDD and SSD
brands have always had this oh-so-smart idea of lying about the flush
being performed.
They return the flush before the data is on stable media.
Not all disks do that, but the majority.
It is difficult to tell exactly what they do because it's implementation
dependent, it depends on the brand, and its undisclosed.
Between totally ignoring the flush command and totally respecting the
flush command there are many possibilities in between.
People who want to do serious storage traditionally have often chosen an
hardware RAID controller with battery backed cache: things seem to
improve a lot in that case, problems are basically fixed, and the most
likely interpretation of this is that disks lie only up to a certain
extent, and will not postpone/reorder a write indefinitely.
For example they might perform periodic barriers, or convert an incoming
flush request to a barrier so that writes are not reordered across that.
The best known method to check if the disks respect the flush is the
diskchecker.pl by the postgresql people. Look it up on the internet and
you will see tons of discussions about the need to test with that for
serious storage, in particular for postgresql / databases (and I say
also RAID).
That's why I suggested that you to make raid5cache behave like a battery
backed linear cache: you writeback linearly starting from the oldest
requests, but you keep two pointers: one where you are issuing the
writeback, and another more behind called replay_ptr: log space ahead of
this point still cannot be reclaimed. The distance between the two could
be measured in seconds and/or in megabytes: still I am not sure which of
the two is better, probably the worst of the two values would have to be
used. The two values should be configurable because the needed value
probably depends on the implementation of the disks.
I see that the code of raid5cache goes to great lengths so to first
writeback the full stripes, however I'm not sure of why the code does
that, because it doesn't seem to me that you can reclaim log space
out-of-order, that's a linear space, right? So why don't you simply
flush in-order starting from the oldest request? A in-order writeback is
also simpler to implement. If you do it out of order my idea is probably
impossible to be implemented.
SSDs also lie, there are even cases known of a brand which has
supercapacitors, which also claims to have power loss data protection
for data at-rest (which would be the flushed data) but in fact still
lies about the flush being performed.
There are also some disks which don't lie and don't have
supercapacitors: it appears that the Samsung "PRO" line (still consumer
and no supercap) doesn't lie (EVO unknown, I didn't find discussions),
if you search for "samsung diskchecker.pl" there are a few people who
tested that one. Most SSDs lie. Other known to respect flushes are the
Intel brand *in those for datacenter WITH supercapacitors*.
The disks which lie usually lie in a way that do not corrupt filesystems
in the simplest case: a filesystem on a single disk. Maybe they
transform a flush into a barrier internally. However in RAID things are
very different, because upon power loss two disks of the set can "roll
back" to a different point in time, at that point RAIDed data is
screwed, even with raid1, even with bitmap active (bitmap also is taken
from a single disk and would be wrong), and for raid5/6 the damages
would probably be more serious.
Clearly the user needs non-lying disks at least for the backend device
for raid5cache. Such requirement cannot easily be worked around with code.
And this would fix only raid5/6: still there would be no workaround for
raid1 and raid10.
>>
>> 3- Write back during idle times
>> It seems to me that with current code the cache will forever stay
>> not-empty even in case of low amounts of writes.
>> The raid5cache apparently does not leverage moments of idleness to
>> writeback (clean) itself completely to the array AFAICS...
>> [...]
>
> This is a good idea. Proactive writing back could also benefit p99
latency
> of random write cases. However, this requires some serious development
> work, which I don't have bandwidth in short term. How about we fix #1
first,
> and see whether #3 is still urgent?
Sure, sure, and thanks a lot
Also I wanted to say that a very common use case when one has SSDs is to
put the RAID array (including raid5cache) behind a bcache. Bcache takes
care of most random io, so a linear writeback by raid5cache is probably
acceptable.
Thank you
N.B.