On Mon, Nov 21, 2011 at 04:21:30PM -0800, Gregory Farnum wrote: > I lied a little bit — turns out an admin restarted the node with > reboot -fn. FWIW, that reboot command does this: -n Don't sync before reboot or halt. Note that the kernel and storage drivers may still sync. -f Force halt or reboot, don't call shutdown(8). In other words, your admin basically told the system to shutdown without syncing the data or running shutdown scripts that sync data. i.e. it forces an immediate reboot while the system is still active, causing an unclean shutdown and guaranteed data loss. > But I've been assured this shouldn't have been able to > corrupt the filesystem, so troubleshooting continues. That depends entirely on your hardware. Are you running with barriers enabled? If you don't have barriers active, then metadata corruption is entirely possible in this scenarion, especially if the hardware does a drive reset or power cycle during the reboot procedure. Even with barriers, there are RAID controllers that enable back end drive caches and they fail to get flushed and hence can cause corruption on unclean shutdowns. IOWs, I'd be looking at how your storage is configured and ruling that out as a cause before even trying to look at the filesystem... > On Mon, Nov 21, 2011 at 2:13 PM, Ben Myers <bpm@xxxxxxx> wrote: > > Hey Greg, > > > > It might be useful if you can provide an xfs_metadump of the filesystem. > > > > xfs_metadump /dev/foo - | bzip2 > /tmp/foo.bz2 > Sure. I posted it at ceph.newdream.net/sdg1.bz2 > Thanks! > > On Mon, Nov 21, 2011 at 1:52 PM, Emmanuel Florac <eflorac@xxxxxxxxxxxxxx> wrote: > > xfs_check is mostly useless nowadays, use "xfs_repair -n" instead. At > > this stage, there's probably not much you can do but an "xfs_repair -L" > > to zero the log. Hope for the better. > > oot@cephstore6358:~# xfs_repair -n /dev/sdg1 > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - scan filesystem freespace and inode maps... > block (1,7800040-7800040) multiply claimed by cnt space tree, state - 2 > agf_freeblks 80672443, counted 80672410 in ag 1 > sb_icount 64, counted 251840 > sb_ifree 61, counted 66 > sb_fdblocks 462898325, counted 358494731 ..... All these errors are likely to be caused by the fact log replay has not completed. The only one that is suspect is the first one: > block (1,7800040-7800040) multiply claimed by cnt space tree, state - 2 But there's no way the cause of that can be determined after the fact from a metadump.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs