On 2019/3/31 下午10:37, Hannes Reinecke wrote: > On 3/31/19 4:17 PM, Qu Wenruo wrote: >> >> >> On 2019/3/31 下午9:36, Hannes Reinecke wrote: >>> On 3/31/19 2:00 PM, Qu Wenruo wrote: >>>> >>>> >>>> On 2019/3/31 下午7:27, Alberto Bursi wrote: >>>>> >>>>> On 30/03/19 13:31, Qu Wenruo wrote: >>>>>> Hi, >>>>>> >>>>>> I'm wondering if it's possible that certain physical device doesn't >>>>>> handle flush correctly. >>>>>> >>>>>> E.g. some vendor does some complex logical in their hdd controller to >>>>>> skip certain flush request (but not all, obviously) to improve >>>>>> performance? >>>>>> >>>>>> Do anyone see such reports? >>>>>> >>>>>> And if proves to happened before, how do we users detect such >>>>>> problem? >>>>>> >>>>>> Can we just check the flush time against the write before flush call? >>>>>> E.g. write X random blocks into that device, call fsync() on it, >>>>>> check >>>>>> the execution time. Repeat Y times, and compare the avg/std. >>>>>> And change X to 2X/4X/..., repeat above check. >>>>>> >>>>>> Thanks, >>>>>> Qu >>>>>> >>>>>> >>>>> >>>>> Afaik HDDs and SSDs do lie to fsync() >>>> >>>> fsync() on block device is interpreted into FLUSH bio. >>>> >>>> If all/most consumer level SATA HDD/SSD devices are lying, then >>>> there is >>>> no power loss safety at all for any fs. As most fs relies on FLUSH bio >>>> to implement barrier. >>>> >>>> And for fs with generation check, they all should report metadata from >>>> the future every time a crash happens, or even worse gracefully >>>> umounting fs would cause corruption. >>>> >>> Please, stop making assumptions. >> >> I'm not. >> >>> >>> Disks don't 'lie' about anything, they report things according to the >>> (SCSI) standard. >>> And the SCSI standard has two ways of ensuring that things are written >>> to disk: the SYNCHRONIZE_CACHE command and the FUA (force unit access) >>> bit in the command. >> >> I understand FLUSH and FUA. >> >>> The latter provides a way of ensuring that a single command made it to >>> disk, and the former instructs the driver to: >>> >>> "a) perform a write medium operation to the LBA using the logical block >>> data in volatile cache; or >>> b) write the logical block to the non-volatile cache, if any." >>> >>> which means it's perfectly fine to treat the write-cache as a >>> _non-volative_ cache if the RAID HBA is battery backed, and thus can >>> make sure that outstanding I/O can be written back even in the case of a >>> power failure. >>> >>> The FUA handling, OTOH, is another matter, and indeed is causing some >>> raised eyebrows when comparing it to the spec. But that's another story. >> >> I don't care FUA as much, since libata still doesn't support FUA by >> default and interpret it as FLUSH/WRITE/FLUSH, so it doesn't make things >> worse. >> >> I'm more interesting in, are all SATA/NVMe disks follows this FLUSH >> behavior? >> > They have to to be spec compliant. > >> For most case, I believe it is, or whatever the fs is, either CoW based >> or journal based, we're going to see tons of problems, even gracefully >> unmounted fs can have corruption if FLUSH is not implemented well. >> >> I'm interested in, is there some device doesn't completely follow >> regular FLUSH requirement, but do some tricks, for certain tested fs. >> > Not that I'm aware of. That's great to know. > >> E.g. the disk is only tested for certain fs, and that fs always does >> something like flush, write flush, fua. >> In that case, if the controller decides to skip the 2nd flush, but only >> do the first flush and fua, if the 2nd write is very small (e.g. >> journal), the chance of corruption is pretty low due to the small window. >> > Highly unlikely. > Tweaking flush handling in this way is IMO far too complicated, and > would only add to the complexity of adding flush handling in firmware in > the first place. > Whereas the whole point of this exercise would be to _reduce_ complexity > in firmware (no-one really cares about the hardware here; that's already > factored in during manufacturing, and reliability is measured in such a > broad way that it doesn't make sense for the manufacture to try to > 'improve' reliability by tweaking the flush algorithm). > So if someone would be wanting to save money they'd do away with the > entire flush handling and do not implement a write cache at all. > That even saves them money on the hardware, too. If there is no report for consumer level hdd/ssd, then it should be fine, and matches my understanding. Thanks, Qu > > Cheers, > > Hannes
Attachment:
signature.asc
Description: OpenPGP digital signature