On 29/01/2019 12:14, Werner Fischer wrote:
Hello all,
I'd like to ask whether it is necessary to switch the write cache of
HDDs and
SSDs (without power-loss-protection) to off when they are used for
mdraid.
I would think it would not work and might very well worsen the situation.
On SSDs, flushing to stable medium is expensive because also the FTL
(Flash Translation Layer), also called metadata, must be written to
stable medium, together with the last data writes.
On properly implemented SSDs, this also is flushed when there is a flush
request from the OS.
If the SSD lies you can bet it will lie also on flushing the metadata.
This might still not corrupt a filesystem, on certain SSD
implementations if the filesystem is on single-disk, if but will most
likely corrupt it if it is in raid, as I wrote on the other thread.
If you disable the write cache, you are theoretically forcing the SSD to
flush the new metadata every time it writes even a single sector, which
is crazy, would drop the performance greatly, would amplify the writes
greatly, would reduce the endurance greatly. So I am inclined to think
that the FTL will not be flushed.
Sandisk declares this behaviour explicitly for their SSDs
https://solidstatedisks.co.uk/Downloads/Sandisk_Unexpected_Power_Loss_Protection.pdf
Read this paragraph:
---------------
Disable the Use of SSD Volatile Cache
[...]
Note: Metadata tables stored on the volatile cache are not affected.
[...]
Cons: Cache disabled configuration significantly reduces the overall SSD
performance (device metadata tables are still exposed).
---------------
kudos on Sandisk for telling us *something* instead of the usual
nothing. Really appreciated.
However after disabling the write cache the OS (I have not checked) or
the disk might think that since the write cache is disabled, the flush
commands are not to be sent anymore, which would prevent the flush of
the FTL, worsening the situation further, compared to cache enabled.
AFAIR linux does not issue flushes if the disk reports to have a
writethrough cache (basically it means no cache) so I would expect to
not issue flushes even on cache-disabled disks.
As discussed by Nik.Brt. and Song Liu last week, many storage devices
(HDDs/SSDs) "lie" when they indicate that the have written data. The
data is
only in the drive's cache, but not on magnetic disc or flash. "The disk's
embedded microcontroller may signal the main computer that a disk
write is
complete immediately after receiving the write data, before the data
is actually
written to the platter." [1]
This is the correct behaviour. The write is complete when it reaches the
DRAM cache of the disk.
You need a flush to guarantee data is on the platters / flash stable
medium and wait for the flush to return.
The problem arises when the disk lies on such flush.
When used as a single disc, this can be handled with modern file
systems, as
they use write barriers. [2][3]
But what I'm not sure is, how this is handled by mdraid in case of a
sudden
power loss. In the past I've recommended to disable the drive's write
cache by
using "hdparm -W0". This is also the default behavior of hardware raid
controllers. They switch off the drive cache of HDDs as they use their
internal
(battery-backed) cache.
So my questions is:
Is it save to keep the cache of HDDs and SSDs (without
power-loss-protection)
"(without power-loss-protection)" : You never know if one SSD is with or
without power loss protection, not from the specs at least.
The specs lie. Even at least 1 brand/model with visible supercapacitors
has been found not honoring the flush, tested with diskchecker.pl . The
specs cited something along the lines of "power-loss protection, for
data at rest" . "At rest" to me would mean "after a flush", but
apparently not to the engineers of that company.
to on when used with mdraid?
it is safe to keep the cache on if the disk honors the flush.
If it doesn't honor the flush (aka lie) I don't think you can work
around the problem by disabling the cache or in any other software way.
The only way to work around that would be with a linear replay log
device which emulates the persistent memory of a battery backed raid
controller, which replays the last writes when the power returns, but
with any finite size of such replay log it is theoretically still not
100% guaranteed.