On Jun 27, 2011, at 12:01, Ted Ts'o wrote: > On Mon, Jun 27, 2011 at 05:30:11PM +0200, Lukas Czerner wrote: >>> I've found some. So although data=journal users are minority, there are >>> some. That being said I agree with you we should do something about it >>> - either state that we want to fully support data=journal - and then we >>> should really do better with testing it or deprecate it and remove it >>> (which would save us some complications in the code). >>> >>> I would be slightly in favor of removing it (code simplicity, less options >>> to configure for admin, less options to test for us, some users I've come >>> across actually were not quite sure why they are using it - they just >>> thought it looks safer). > > Hmm... FYI, I hope to be able to bring on line automated testing for > ext4 later this summer (there's a testing person at Google is has > signed up to work on setting this up as his 20% project). The test > matrix that I have him included data=journal, so we will be getting > better testing in the near future. > > At least historically, data=journalling was the *simpler* case, and > was the first thing supported by ext4. (data=ordered required revoke > handling which didn't land for six months or so). So I'm not really > that convinced that removing really buys us that much code > simplification. > > That being siad, it is true that data=journalled isn't necessarily > faster. For heavy disk-bound workloads, it can be slower. So I can > imagine adding some documentation that warns people not to use > data=journal unless they really know what they are doing, but at least > personally, I'm a bit reluctant to dispense with a bug report like > this by saying, "oh, that feature should be deprecated". I suppose I should chime in here, since I'm the one who (potentially incorrectly) thinks I should be using data=journalled mode. My basic impression is that the use of "data=journalled" can help reduce the risk (slightly) of serious corruption to some kinds of databases when the application does not provide appropriate syncs or journalling on its own (IE: such as text-based Wiki database files). Please correct me if this is horribly horribly wrong: no journal: Nothing is journalled + Very fast. + Works well for filesystems that are "mkfs"ed on every boot - Have to fsck after every reboot data=writeback: Metadata is journalled, data (to allocated extents) may be written before or after the metadata is updated with a new file size. + Fast (not as fast as unjournalled) + No need to "fsck" after a hard power-down - A crash or power failure in the middle of a write could leave old data on disk at the end of a file. If security labeling such as SELinux is enabled, this could "contaminate" a file with data from a deleted file that was at a higher sensitivity. Log files (including binary database replication logs) may be effectively corrupted as a result. data=ordered: Data appended to a file will be written before the metadata extending the length of the file is written, and in certain cases the data will be written before file renames (partial ordering), but the data itself is unjournalled, and may be only partially complete for updates. + Does not write data to the media twice + A crash or power failure will not leave old uninitialized data in files. - Data writes to files may only partially complete in the event of a crash. No problems for logfiles, or self-journalled application databases, but others may experience partial writes in the event of a crash and need recovery. data=journalled: Data and metadata are both journalled, meaning that a given data write will either complete or it will never occur, although the precise ordering is not guaranteed. This also implies all of the data<=>metadata guarantees of data=ordered. + Direct IO data writes are effectively "atomic", resulting in less likelihood of data loss for application databases which do not do their own journalling. This means that a power failure or system crash will not result in a partially-complete write. - Cached writes are not atomic + For small cached file writes (of only a few filesystem pages) there is a good chance that kernel writeback will queue the entire write as a single I/O and it will be "protected" as a result. This helps reduce the chance of serious damage to some text-based database files (such as those for some Wikis), but is obviously not a guarantee. - This writes all data to the block device twice (once to the FS journal and once to the data blocks). This may be especially bad for write-limited Flash-backed devices. Cheers, Kyle Moffett -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html