On Tue, Jul 07, 2009 at 06:16:23PM +0000, Evan King wrote: > Hello all, > > I'm administering a small computing cluster on new off-the-shelf hardware. The > configuration is a master-slaves setup with the master serving nfs for the data > synchronization and performing the data re-assembly process (as well as doing > some slave work as well). > > The workload produces a fairly steady I/O workload, but not particularly heavy. > While I originally pushed for specialized storage hardware or configurations, > testing and benchmarking showed that the workload appeared quite manageable for > a single disk. I expected it might experience a short lifespan, but on the > order of several months at least. To spare the disk as much thrashing as > possible, I opted for ext4. > > In the first week of active deployment (and while I was on vacation), the master > experienced a very strange form of catastrophic failure. A job had failed after > only a couple hours, and serious errors blocked further work. Several core GNU > tools in /bin were corrupted, such as: mv, rm, uname, hostname, pwd. A couple > 0-byte files existed in / with scrambled filenames, and plenty of Unicode > characters splattered across the screen during reboot. The reboot itself > reached a login prompt, but wouldn't accept any input. But this is where things > get strange. > > I used a liveCD to perform disk checks, and there were no filesystem errors of > *any* kind. The entire filesystem was and is in pristine condition. While I'm > aware of discussion and issues surrounding some of the design decisions made for > ext4 (such as delayed write allocation), it doesn't seem possible that those > issues could be related to this kind of failure (data written without permission > or any attempt to do so). The corrupted binaries were in fact corrupted on > disk, not just in memory (also unreadable by readelf), and larger than the > originals. The software I was using runs from a user-level account and has an > apache-served web interface with apache dropping permissions to that same user. > Nothing but the kernel itself had permission to write to the files that were > corrupted, however the computing software does execute (I think all of) the > commands that were corrupted. > > I have saved copies of several of the corrupted files, but neglected to save any > system logs before restoring a backup. There are still some strange messages > appearing during startup, but they fly by too quickly to see, and nothing seems > amiss in the logs except that /var/log/messages seems extremely verbose with > startup and has many references to initializing ext4 (but nothing sounds like an > error). I'm about to tell my users to start using it again and will be > expecting and watching for a repeat performance. The disk itself appears to be > fine. > > _____ > > So my questions are these: > > - How likely is it that some arcane bug in ext4 is responsible for the failure? Can you check whether your kernel have this patch http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4 -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html