Re: Disabling HDD write cache neccessary (hdparm -W0)?

"Nik.Brt." <nik.brt@xxxxxxxxxxxxx> · Wed, 30 Jan 2019 09:24:33 +0100

On 29/01/2019 12:14, Werner Fischer wrote:
Hello all,

I'd like to ask whether it is necessary to switch the write cache of 
HDDs and
SSDs (without power-loss-protection) to off when they are used for 
mdraid.

I would think it would not work and might very well worsen the situation.

On SSDs, flushing to stable medium is expensive because also the FTL 
(Flash Translation Layer), also called metadata, must be written to 
stable medium, together with the last data writes.
On properly implemented SSDs, this also is flushed when there is a flush 
request from the OS.
If the SSD lies you can bet it will lie also on flushing the metadata.
This might still not corrupt a filesystem, on certain SSD 
implementations if the filesystem is on single-disk, if but will most 
likely corrupt it if it is in raid, as I wrote on the other thread.

If you disable the write cache, you are theoretically forcing the SSD to 
flush the new metadata every time it writes even a single sector, which 
is crazy, would drop the performance greatly, would amplify the writes 
greatly, would reduce the endurance greatly. So I am inclined to think 
that the FTL will not be flushed.

Sandisk declares this behaviour explicitly for their SSDs
https://solidstatedisks.co.uk/Downloads/Sandisk_Unexpected_Power_Loss_Protection.pdf
Read this paragraph:
---------------
Disable the Use of SSD Volatile Cache
[...]
Note: Metadata tables stored on the volatile cache are not affected.
[...]
Cons: Cache disabled configuration significantly reduces the overall SSD 
performance (device metadata tables are still exposed).
---------------
kudos on Sandisk for telling us *something* instead of the usual 
nothing. Really appreciated.

However after disabling the write cache the OS (I have not checked) or 
the disk might think that since the write cache is disabled, the flush 
commands are not to be sent anymore, which would prevent the flush of 
the FTL, worsening the situation further, compared to cache enabled. 
AFAIR linux does not issue flushes if the disk reports to have a 
writethrough cache (basically it means no cache) so I would expect to 
not issue flushes even on cache-disabled disks.

As discussed by Nik.Brt. and Song Liu last week, many storage devices
(HDDs/SSDs) "lie" when they indicate that the have written data. The 
data is
only in the drive's cache, but not on magnetic disc or flash. "The disk's
embedded microcontroller may signal the main computer that a disk 
write is
complete immediately after receiving the write data, before the data 
is actually
written to the platter." [1]

This is the correct behaviour. The write is complete when it reaches the 
DRAM cache of the disk.
You need a flush to guarantee data is on the platters / flash stable 
medium and wait for the flush to return.
The problem arises when the disk lies on such flush.

When used as a single disc, this can be handled with modern file 
systems, as
they use write barriers. [2][3]

But what I'm not sure is, how this is handled by mdraid in case of a 
sudden
power loss. In the past I've recommended to disable the drive's write 
cache by
using "hdparm -W0". This is also the default behavior of hardware raid
controllers. They switch off the drive cache of HDDs as they use their 
internal
(battery-backed) cache.

So my questions is:
Is it save to keep the cache of HDDs and SSDs (without 
power-loss-protection)

"(without power-loss-protection)" : You never know if one SSD is with or 
without power loss protection, not from the specs at least.
The specs lie. Even at least 1 brand/model with visible supercapacitors 
has been found not honoring the flush, tested with diskchecker.pl . The 
specs cited something along the lines of "power-loss protection, for 
data at rest" . "At rest" to me would mean "after a flush", but 
apparently not to the engineers of that company.

to on when used with mdraid?

it is safe to keep the cache on if the disk honors the flush.
If it doesn't honor the flush (aka lie) I don't think you can work 
around the problem by disabling the cache or in any other software way.
The only way to work around that would be with a linear replay log 
device which emulates the persistent memory of a battery backed raid 
controller, which replays the last writes when the power returns, but 
with any finite size of such replay log it is theoretically still not 
100% guaranteed.