Re: Bugfix / feature requests for raid5cache (writeback)

Song Liu <liu.song.a23@xxxxxxxxx> · Fri, 25 Jan 2019 11:39:39 -0800

On Thu, Jan 24, 2019 at 11:03 PM Nik.Brt. <nik.brt@xxxxxxxxxxxxx> wrote:
>
> Hi Song,
>
> On 24/01/2019 19:29, Song Liu wrote:
>  >>
>  >> https://www.spinics.net/lists/raid/msg61331.html
>  >> https://www.spinics.net/lists/raid/msg60713.html
>  >> but it's currently not applied upstream in latest v4.20.4
>  >
>  > Thanks for bringing this back. We didn't make enough progress because
>  > Shaohua was very sick back then.
>  >
>  > I think the fix would work. I will pick it up from here.
>
> Oooh I'm so sorry... I just realized what happened to Shaohua.
> That's so bad :-(((((
> May he rest in peace.....
>
>  >>
>  >> 2- Workaround for liar disks
>  >> [...]
>  >> There should be an additional pointer to the log (replay_ptr) which
>  >> stays at least XX MB behind and at least YY seconds (both configurable)
>  >> behind the last flush to the RAID disks
>  >
>  > Could you please be more specific on how the RAID disk lies about
>  > flush? Does it "claim flush is done, before fully flush volatile cache"?
>  > If this is the case, I think we really cannot guarantee the data is
>  > secure, as it is hard to determine the proper XX MB and YY seconds
>  > value.
>
> Yes, so, this is a "kinda known" problem in the storage topics.
> Due to the fact that the hardware review websites always test
> performances and never test for consistency problems, the HDD and SSD
> brands have always had this oh-so-smart idea of lying about the flush
> being performed.
> They return the flush before the data is on stable media.
>
> Not all disks do that, but the majority.
> It is difficult to tell exactly what they do because it's implementation
> dependent, it depends on the brand, and its undisclosed.
> Between totally ignoring the flush command and totally respecting the
> flush command there are many possibilities in between.
>
> People who want to do serious storage traditionally have often chosen an
> hardware RAID controller with battery backed cache: things seem to
> improve a lot in that case, problems are basically fixed, and the most
> likely interpretation of this is that disks lie only up to a certain
> extent, and will not postpone/reorder a write indefinitely.
>
> For example they might perform periodic barriers, or convert an incoming
> flush request to a barrier so that writes are not reordered across that.
>
> The best known method to check if the disks respect the flush is the
> diskchecker.pl by the postgresql people. Look it up on the internet and
> you will see tons of discussions about the need to test with that for
> serious storage, in particular for postgresql / databases (and I say
> also RAID).

Thanks for these information. I will try diskchecker.pl with our SSDs.

>
> That's why I suggested that you to make raid5cache behave like a battery
> backed linear cache: you writeback linearly starting from the oldest
> requests, but you keep two pointers: one where you are issuing the
> writeback, and another more behind called replay_ptr: log space ahead of
> this point still cannot be reclaimed. The distance between the two could
> be measured in seconds and/or in megabytes: still I am not sure which of
> the two is better, probably the worst of the two values would have to be
> used. The two values should be configurable because the needed value
> probably depends on the implementation of the disks.
>
> I see that the code of raid5cache goes to great lengths so to first
> writeback the full stripes, however I'm not sure of why the code does
> that, because it doesn't seem to me that you can reclaim log space
> out-of-order, that's a linear space, right? So why don't you simply
> flush in-order starting from the oldest request? A in-order writeback is
> also simpler to implement. If you do it out of order my idea is probably
> impossible to be implemented.

full stripe writes give much better performance. So we keep more data
in the cache hoping to get more full stripe writes. Of course, keeping data
longer doesn't always give more full stripe writes. I had some idea to use
heuristic algorithm to decide how long to keep data in cache, but I haven't
got time for it.

Out of order flush is not a problem, R5LOG_PAYLOAD_FLUSH is used to
identify which stripes need to be flushed.

>
> SSDs also lie, there are even cases known of a brand which has
> supercapacitors, which also claims to have power loss data protection
> for data at-rest (which would be the flushed data) but in fact still
> lies about the flush being performed.
>
> There are also some disks which don't lie and don't have
> supercapacitors: it appears that the Samsung "PRO" line (still consumer
> and no supercap) doesn't lie (EVO unknown, I didn't find discussions),
> if you search for "samsung diskchecker.pl" there are a few people who
> tested that one. Most SSDs lie. Other known to respect flushes are the
> Intel brand *in those for datacenter WITH supercapacitors*.
>
> The disks which lie usually lie in a way that do not corrupt filesystems
> in the simplest case: a filesystem on a single disk. Maybe they
> transform a flush into a barrier internally. However in RAID things are
> very different, because upon power loss two disks of the set can "roll
> back" to a different point in time, at that point RAIDed data is
> screwed, even with raid1, even with bitmap active (bitmap also is taken
> from a single disk and would be wrong), and for raid5/6 the damages
> would probably be more serious.
>
> Clearly the user needs non-lying disks at least for the backend device
> for raid5cache. Such requirement cannot easily be worked around with code.
>
> And this would fix only raid5/6: still there would be no workaround for
> raid1 and raid10.
>
>  >>
>  >> 3- Write back during idle times
>  >> It seems to me that with current code the cache will forever stay
>  >> not-empty even in case of low amounts of writes.
>  >> The raid5cache apparently does not leverage moments of idleness to
>  >> writeback (clean) itself completely to the array AFAICS...
>  >> [...]
>  >
>  > This is a good idea. Proactive writing back could also benefit p99
> latency
>  > of random write cases. However, this requires some serious development
>  > work, which I don't have bandwidth in short term. How about we fix #1
> first,
>  > and see whether #3 is still urgent?
>
> Sure, sure, and thanks a lot
>
> Also I wanted to say that a very common use case when one has SSDs is to
> put the RAID array (including raid5cache) behind a bcache. Bcache takes
> care of most random io, so a linear writeback by raid5cache is probably
> acceptable.

I think linear write back or out of order write back doesn't matter that much.
We need be careful with replay, because with write back cache, the replay
may not be simple replay, it could also be read-modify-write.

Thanks,
Song

>
> Thank you
>
> N.B.