On Tue, Nov 22, 2011 at 1:53 PM, Eric Sandeen <sandeen@xxxxxxxxxxx> wrote: > And this was the first indication of trouble. > >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.214692] >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.227313] Pid: 11196, comm: >> ceph-osd Not tainted 3.1.0-dho-00004-g1ffcb5c-dirty #1 >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.235056] Call Trace: >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.237530] >> [<ffffffff811d606e>] ? xfs_free_ag_extent+0x4e3/0x698 >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.243717] >> [<ffffffff811d6b71>] ? xfs_free_extent+0xb6/0xf9 >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.249468] >> [<ffffffff811d3034>] ? kmem_zone_alloc+0x58/0x9e >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.255220] >> [<ffffffff812095f9>] ? xfs_trans_get_efd+0x21/0x2a >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.261159] >> [<ffffffff811e2011>] ? xfs_bmap_finish+0xeb/0x160 >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.266993] >> [<ffffffff811f8634>] ? xfs_itruncate_extents+0xe8/0x1d0 >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.273361] >> [<ffffffff811f879f>] ? xfs_itruncate_data+0x83/0xee >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.279362] >> [<ffffffff811cb0a2>] ? xfs_setattr_size+0x246/0x36c >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.285363] >> [<ffffffff811cb1e3>] ? xfs_vn_setattr+0x1b/0x2f >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.291031] >> [<ffffffff810e7875>] ? notify_change+0x16d/0x23e >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.296776] >> [<ffffffff810d2982>] ? do_truncate+0x68/0x86 >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.302172] >> [<ffffffff810d2b11>] ? sys_truncate+0x171/0x173 >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.307846] >> [<ffffffff8166c07b>] ? system_call_fastpath+0x16/0x1b >> Nov 17 16:01:01 cephstore6358 kernel: [ 214.314031] XFS (sdg1): >> xfs_do_force_shutdown(0x8) called from line 3864 of file >> fs/xfs/xfs_bmap.c. Return address = 0xffffffff811e2046 > > by here it had shut down, and you were just riding along when > it went kablooey. Any non-xfs error just before this point? Nope, nothing from anybody else. On Tue, Nov 22, 2011 at 2:11 PM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > On Tue, Nov 22, 2011 at 10:47:24AM -0800, Gregory Farnum wrote: >> Barriers on (at least, nobody turned them off); the RAID card is >> battery-backed; here are megacli dumps: >> http://pastebin.com/yTskgzWG >> http://pastebin.com/ekhczycy > > I had a lot of of issues with megaraid cards and their unsafe caching > settings, up to the point that I'd recommend staying away from them > now. Can you check in the megacli config if the _disk_ write caches > are enabled? megaraid adapters used to do that a lot, and given that > the disk cache isn't batter backed it's fairly fatal. > > I think in your dump this one might be the culprit given that SATA > disks outside of a few niches come with a writeback cache policy: > > Disk Cache Policy: Disk's Default > > try changing that to an explicit writethrough mode - and maybe try > running a crash data integrity test like > > http://www.complang.tuwien.ac.at/anton/hdtest/ > > on this controller. We're going to look into this in more detail very shortly. Right now all I can tell you is that none of the drives ever actually lost power, so unless something is explicitly telling them to clear their caches I don't know how the drives could have lost their cache to cause a problem like this. But for now I'll just see what I can get by zeroing out the log, and we'll get back to you again if we manage to reproduce this in a situation where we can tell you more definitively about the caching and barriers. Thanks guys, -Greg _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs