* Theodore Tso: > The only one that falls into that category is the one about not being > able to handle failed writes, and the way most failures take place, Hmm. What does "not being able to handle failed writes" actually mean? AFAICS, there are two possible answers: "all bets are off", or "we'll tell you about the problem, and all bets are off". >> Isn't this by design? In other words, if the metadata doesn't survive >> non-atomic writes, wouldn't it be an ext3 bug? > > Part of the problem here is that "atomic-writes" is confusing; it > doesn't mean what many people think it means. The assumption which > many naive filesystem designers make is that writes succeed or they > don't. If they don't succeed, they don't change the previously > existing data in any way. Right. And a lot of database systems make the same assumption. Oracle Berkeley DB cannot deal with partial page writes at all, and PostgreSQL assumes that it's safe to flip a few bits in a sector without proper WAL (it doesn't care if the changes actually hit the disk, but the write shouldn't make the sector unreadable or put random bytes there). > Is that a file system "bug"? Well, it's better to call that a > mismatch between the assumptions made of physical devices, and of the > file system code. On Irix, SGI hardware had a powerfail interrupt, > and the power supply and extra-big capacitors, so that when a power > fail interrupt came in, the Irix would run around frantically shutting > down pending DMA transfers to prevent this failure mode from causing > problems. PC class hardware (according to Ted's law), is cr*p, and > doesn't have a powerfail interrupt, so it's not something that we > have. The DMA transaction should fail due to ECC errors, though. > Ext3, ext4, and ocfs2 does physical block journalling, so as long as > journal truncate hasn't taken place right before the failure, the > replay of the physical block journal tends to repair this most (but > not necessarily all) cases of "garbage is written right before power > failure". People who care about this should really use a UPS, and > wire up the USB and/or serial cable from the UPS to the system, so > that the OS can do a controlled shutdown if the UPS is close to > shutting down due to an extended power failure. I think the general idea is to protect valuable data with WAL. You overwrite pages on disk only after you've made a backup copy into WAL. After a power loss event, you replay the log and overwrite all garbage that might be there. For the WAL, you rely on checksum and sequence numbers. This still doesn't help against write failures where the system continues running (because the fsync() during checkpointing isn't guaranteed to report errors), but it should deal with the power failure case. But this assumes that the file system protects its own data structure in a similar way. Is this really too much to demand? Partial failures are extremely difficult to deal with because of their asynchronous nature. I've come to accept that, but it's still disappointing. -- Florian Weimer <fweimer@xxxxxx> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html