Re: Bugfix / feature requests for raid5cache (writeback)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Song,

On 24/01/2019 19:29, Song Liu wrote:
>>
>> https://www.spinics.net/lists/raid/msg61331.html
>> https://www.spinics.net/lists/raid/msg60713.html
>> but it's currently not applied upstream in latest v4.20.4
>
> Thanks for bringing this back. We didn't make enough progress because
> Shaohua was very sick back then.
>
> I think the fix would work. I will pick it up from here.

Oooh I'm so sorry... I just realized what happened to Shaohua.
That's so bad :-(((((
May he rest in peace.....

>>
>> 2- Workaround for liar disks
>> [...]
>> There should be an additional pointer to the log (replay_ptr) which
>> stays at least XX MB behind and at least YY seconds (both configurable)
>> behind the last flush to the RAID disks
>
> Could you please be more specific on how the RAID disk lies about
> flush? Does it "claim flush is done, before fully flush volatile cache"?
> If this is the case, I think we really cannot guarantee the data is
> secure, as it is hard to determine the proper XX MB and YY seconds
> value.

Yes, so, this is a "kinda known" problem in the storage topics.
Due to the fact that the hardware review websites always test performances and never test for consistency problems, the HDD and SSD brands have always had this oh-so-smart idea of lying about the flush being performed.
They return the flush before the data is on stable media.

Not all disks do that, but the majority.
It is difficult to tell exactly what they do because it's implementation dependent, it depends on the brand, and its undisclosed. Between totally ignoring the flush command and totally respecting the flush command there are many possibilities in between.

People who want to do serious storage traditionally have often chosen an hardware RAID controller with battery backed cache: things seem to improve a lot in that case, problems are basically fixed, and the most likely interpretation of this is that disks lie only up to a certain extent, and will not postpone/reorder a write indefinitely.

For example they might perform periodic barriers, or convert an incoming flush request to a barrier so that writes are not reordered across that.

The best known method to check if the disks respect the flush is the diskchecker.pl by the postgresql people. Look it up on the internet and you will see tons of discussions about the need to test with that for serious storage, in particular for postgresql / databases (and I say also RAID).

That's why I suggested that you to make raid5cache behave like a battery backed linear cache: you writeback linearly starting from the oldest requests, but you keep two pointers: one where you are issuing the writeback, and another more behind called replay_ptr: log space ahead of this point still cannot be reclaimed. The distance between the two could be measured in seconds and/or in megabytes: still I am not sure which of the two is better, probably the worst of the two values would have to be used. The two values should be configurable because the needed value probably depends on the implementation of the disks.

I see that the code of raid5cache goes to great lengths so to first writeback the full stripes, however I'm not sure of why the code does that, because it doesn't seem to me that you can reclaim log space out-of-order, that's a linear space, right? So why don't you simply flush in-order starting from the oldest request? A in-order writeback is also simpler to implement. If you do it out of order my idea is probably impossible to be implemented.

SSDs also lie, there are even cases known of a brand which has supercapacitors, which also claims to have power loss data protection for data at-rest (which would be the flushed data) but in fact still lies about the flush being performed.

There are also some disks which don't lie and don't have supercapacitors: it appears that the Samsung "PRO" line (still consumer and no supercap) doesn't lie (EVO unknown, I didn't find discussions), if you search for "samsung diskchecker.pl" there are a few people who tested that one. Most SSDs lie. Other known to respect flushes are the Intel brand *in those for datacenter WITH supercapacitors*.

The disks which lie usually lie in a way that do not corrupt filesystems in the simplest case: a filesystem on a single disk. Maybe they transform a flush into a barrier internally. However in RAID things are very different, because upon power loss two disks of the set can "roll back" to a different point in time, at that point RAIDed data is screwed, even with raid1, even with bitmap active (bitmap also is taken from a single disk and would be wrong), and for raid5/6 the damages would probably be more serious.

Clearly the user needs non-lying disks at least for the backend device for raid5cache. Such requirement cannot easily be worked around with code.

And this would fix only raid5/6: still there would be no workaround for raid1 and raid10.

>>
>> 3- Write back during idle times
>> It seems to me that with current code the cache will forever stay
>> not-empty even in case of low amounts of writes.
>> The raid5cache apparently does not leverage moments of idleness to
>> writeback (clean) itself completely to the array AFAICS...
>> [...]
>
> This is a good idea. Proactive writing back could also benefit p99 latency
> of random write cases. However, this requires some serious development
> work, which I don't have bandwidth in short term. How about we fix #1 first,
> and see whether #3 is still urgent?

Sure, sure, and thanks a lot

Also I wanted to say that a very common use case when one has SSDs is to put the RAID array (including raid5cache) behind a bcache. Bcache takes care of most random io, so a linear writeback by raid5cache is probably acceptable.

Thank you

N.B.



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux