On Sun, May 18, 2008 at 10:03:55PM +0200, Andi Kleen wrote: > Eric Sandeen <sandeen@xxxxxxxxxx> writes: > > > > Right, that was the plan. I wasn't really going to stand there and pull > > the plug. :) I'd like to get to "out of $NUMBER power-loss events > > under this usage, I saw $THIS corruption $THISMANY times ..." > > I'm not sure how good such exact numbers would do. Surely power down > behaviour that would depend on the exact disk/controller/system > combination? Some might be better at getting data out at > power less, some might be worse. Given how rarely people have reported problems, I think it's a really good idea to understand what exactly our exposure is for $COMMON_HARDWARE. And I suspect the biggest question isn't the hardware, but the workload. Here are the questions that I think are worth asking: * How often can we get corruption on a common desktop workload? Given that we're mostly kernel developers, and kernbench is probably worst case for desktops, that's useful. * What is the performance hit on a common desktop workload (let's use kernbench for consistency). * How often can we get corruption on a hard-core enterprise application with lots of fsync()'s? (i.e., postmark, et. al) * What is the performance hit on a an fsync()-heavy workload? I have a feeling that the likelihood of corruption when running kernbench is minimal, but the performance hit is probably minimal as well. And that the corruption for potential is higher for an fsync-heavy workload, but that's also where we are seeing the (reported) 30% hit. The other thing which we should consider is that I suspect we can do much better for ext4 given that we have journal checksums. As Chris pointed out, right now, with barriers turned on, we are doing this: write log blocks flush #1 write commit block flush #2 write metadata blocks If we don't mind mixing bh and bio functions, we could change it to this for ext4 (when journal checksumming is enabled) write log blocks write commit block flush (via submitting an empty barrier block I/O request) write metadata blocks This should hopefully reduce the performance hit by half, since we're eliminating one of the flushes. Even more interesting would be moving the flush until right before we attempt to write the metadata blocks, and allowing data writes which don't require metadata updates through. That should be safe, even in data=ordered mode. The point is we should think about ways that we can optimize barrier mode for ext4. If we do this, then it may be that people will find it interesting to mount ext3 filesystems using ext4, even without making any additional changes, because of the better choices of speed/safety tradeoffs. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html