Re: fsync() errors is unsafe and risks data loss

Martin Steigerwald <martin@xxxxxxxxxxxx> · Tue, 10 Apr 2018 21:47:21 +0200

Hi Theodore, Darrick, Joshua.

CC´d fsdevel as it does not appear to be Ext4 specific to me (and to you as 
well, Theodore).

Theodore Y. Ts'o - 10.04.18, 20:43:
> This isn't actually an ext4 issue, but a long-standing VFS/MM issue.
[…]
> First of all, what storage devices will do when they hit an exception
> condition is quite non-deterministic.  For example, the vast majority
> of SSD's are not power fail certified.  What this means is that if
> they suffer a power drop while they are doing a GC, it is quite
> possible for data written six months ago to be lost as a result.  The
> LBA could potentialy be far, far away from any LBA's that were
> recently written, and there could have been multiple CACHE FLUSH
> operations in the since the LBA in question was last written six
> months ago.  No matter; for a consumer-grade SSD, it's possible for
> that LBA to be trashed after an unexpected power drop.

Guh. I was not aware of this. I knew consumer-grade SSDs often do not have 
power loss protection, but still thought they´d handle garble collection in an 
atomic way. Sometimes I am tempted to sing an "all hardware is crap" song 
(starting with Meltdown/Spectre, then probably heading over to storage devices 
and so on… including firmware crap like Intel ME).

> Next, the reason why fsync() has the behaviour that it does is one
> ofhe the most common cases of I/O storage errors in buffered use
> cases, certainly as seen by the community distros, is the user who
> pulls out USB stick while it is in use.  In that case, if there are
> dirtied pages in the page cache, the question is what can you do?
> Sooner or later the writes will time out, and if you leave the pages
> dirty, then it effectively becomes a permanent memory leak.  You can't
> unmount the file system --- that requires writing out all of the pages
> such that the dirty bit is turned off.  And if you don't clear the
> dirty bit on an I/O error, then they can never be cleaned.  You can't
> even re-insert the USB stick; the re-inserted USB stick will get a new
> block device.  Worse, when the USB stick was pulled, it will have
> suffered a power drop, and see above about what could happen after a
> power drop for non-power fail certified flash devices --- it goes
> double for the cheap sh*t USB sticks found in the checkout aisle of
> Micro Center.

>From the original PostgreSQL mailing list thread I did not get on how exactly 
FreeBSD differs in behavior, compared to Linux. I am aware of one operating 
system that from a user point of view handles this in almost the right way 
IMHO: AmigaOS.

When you removed a floppy disk from the drive while the OS was writing to it 
it showed a  "You MUST insert volume somename into drive somedrive:" and if 
you did, it just continued writing. (The part that did not work well was that 
with the original filesystem if you did not insert it back, the whole disk was 
corrupted, usually to the point beyond repair, so the "MUST" was no joke.)

In my opinion from a user´s point of view this is the only sane way to handle 
the premature removal of removable media. I have read of a GSoC project to 
implement something like this for NetBSD but I did not check on the outcome of 
it. But in MS-DOS I think there has been something similar, however MS-DOS is 
not an multitasking operating system as AmigaOS is.

Implementing something like this for Linux would be quite a feat, I think, 
cause in addition to the implementation in the kernel, the desktop environment 
or whatever other userspace you use would need to handle it as well, so you´d 
have to adapt udev / udisks / probably Systemd. And probably this behavior 
needs to be restricted to anything that is really removable and even then in 
order to prevent memory exhaustion in case processes continue to write to an 
removed and not yet re-inserted USB harddisk the kernel would need to halt I/O 
processes which dirty I/O to this device. (I believe this is what AmigaOS did. 
It just blocked all subsequent I/O to the device still it was re-inserted. But 
then the I/O handling in that OS at that time is quite different from what 
Linux does.)

> So this is the explanation for why Linux handles I/O errors by
> clearing the dirty bit after reporting the error up to user space.
> And why there is not eagerness to solve the problem simply by "don't
> clear the dirty bit".  For every one Postgres installation that might
> have a better recover after an I/O error, there's probably a thousand
> clueless Fedora and Ubuntu users who will have a much worse user
> experience after a USB stick pull happens.

I was not aware that flash based media may be as crappy as you hint at.

>From my tests with AmigaOS 4.something or AmigaOS 3.9 + 3rd Party Poseidon USB 
stack the above mechanism worked even with USB sticks. I however did not test 
this often and I did not check for data corruption after a test.

Thanks,
-- 
Martin