Re: Disabling HDD write cache neccessary (hdparm -W0)?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 29/01/2019 12:14, Werner Fischer wrote:
Hello all,

I'd like to ask whether it is necessary to switch the write cache of HDDs and SSDs (without power-loss-protection) to off when they are used for mdraid.

I would think it would not work and might very well worsen the situation.

On SSDs, flushing to stable medium is expensive because also the FTL (Flash Translation Layer), also called metadata, must be written to stable medium, together with the last data writes. On properly implemented SSDs, this also is flushed when there is a flush request from the OS.
If the SSD lies you can bet it will lie also on flushing the metadata.
This might still not corrupt a filesystem, on certain SSD implementations if the filesystem is on single-disk, if but will most likely corrupt it if it is in raid, as I wrote on the other thread.

If you disable the write cache, you are theoretically forcing the SSD to flush the new metadata every time it writes even a single sector, which is crazy, would drop the performance greatly, would amplify the writes greatly, would reduce the endurance greatly. So I am inclined to think that the FTL will not be flushed.

Sandisk declares this behaviour explicitly for their SSDs
https://solidstatedisks.co.uk/Downloads/Sandisk_Unexpected_Power_Loss_Protection.pdf
Read this paragraph:
---------------
Disable the Use of SSD Volatile Cache
[...]
Note: Metadata tables stored on the volatile cache are not affected.
[...]
Cons: Cache disabled configuration significantly reduces the overall SSD performance (device metadata tables are still exposed).
---------------
kudos on Sandisk for telling us *something* instead of the usual nothing. Really appreciated.

However after disabling the write cache the OS (I have not checked) or the disk might think that since the write cache is disabled, the flush commands are not to be sent anymore, which would prevent the flush of the FTL, worsening the situation further, compared to cache enabled. AFAIR linux does not issue flushes if the disk reports to have a writethrough cache (basically it means no cache) so I would expect to not issue flushes even on cache-disabled disks.

As discussed by Nik.Brt. and Song Liu last week, many storage devices
(HDDs/SSDs) "lie" when they indicate that the have written data. The data is
only in the drive's cache, but not on magnetic disc or flash. "The disk's
embedded microcontroller may signal the main computer that a disk write is complete immediately after receiving the write data, before the data is actually
written to the platter." [1]

This is the correct behaviour. The write is complete when it reaches the DRAM cache of the disk. You need a flush to guarantee data is on the platters / flash stable medium and wait for the flush to return.
The problem arises when the disk lies on such flush.

When used as a single disc, this can be handled with modern file systems, as
they use write barriers. [2][3]

But what I'm not sure is, how this is handled by mdraid in case of a sudden power loss. In the past I've recommended to disable the drive's write cache by
using "hdparm -W0". This is also the default behavior of hardware raid
controllers. They switch off the drive cache of HDDs as they use their internal
(battery-backed) cache.

So my questions is:
Is it save to keep the cache of HDDs and SSDs (without power-loss-protection)

"(without power-loss-protection)" : You never know if one SSD is with or without power loss protection, not from the specs at least. The specs lie. Even at least 1 brand/model with visible supercapacitors has been found not honoring the flush, tested with diskchecker.pl . The specs cited something along the lines of "power-loss protection, for data at rest" . "At rest" to me would mean "after a flush", but apparently not to the engineers of that company.

to on when used with mdraid?

it is safe to keep the cache on if the disk honors the flush.
If it doesn't honor the flush (aka lie) I don't think you can work around the problem by disabling the cache or in any other software way. The only way to work around that would be with a linear replay log device which emulates the persistent memory of a battery backed raid controller, which replays the last writes when the power returns, but with any finite size of such replay log it is theoretically still not 100% guaranteed.




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux