Hello all, Came across my first real "problem" in all the years I've been using Linux today and I'm stumped for an answer. I've recently built a back-up server on top of Centos 5.3 using a 3Ware 9650SE 4 port SATA card and 4 x 1.5TB drives in a RAID 5 array. I went for the 3Ware card as it has driver support already so any kernel updates won't require me to rebuild the drivers each time. My current kernel is 2.6.18-128.2.1.el5 x86_x64. My problems started a few weeks ago with some message issues with the 3Ware card causing it to stop accessing the mounted RAID array. The only way to resolve it was a cold boot of the machine. Then today I log-in to discover the RAIDed partition has been auto un-mounted by XFS itself. So at this point I'm not sure where the problem lies. Is it the card that is faulty/problematic or is it something with the XFS file system? I'll take the XFS problem first as it's the most recent. When I logged in today I found the following in my server logs: Sep 7 21:08:19 backup kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c. Caller 0xffffffff882e356b Sep 7 21:08:19 backup kernel: Sep 7 21:08:19 backup kernel: Call Trace: Sep 7 21:08:19 backup kernel: [<ffffffff882e1c7e>] :xfs:xfs_free_ag_extent+0x19f/0x67f Sep 7 21:08:19 backup kernel: [<ffffffff882e356b>] :xfs:xfs_free_extent+0xa9/0xc9 Sep 7 21:08:19 backup kernel: [<ffffffff882f02ad>] :xfs:xfs_bmap_finish+0xf0/0x169 Sep 7 21:08:19 backup kernel: [<ffffffff8830de46>] :xfs:xfs_itruncate_finish+0x172/0x2b3 Sep 7 21:08:19 backup kernel: [<ffffffff88326f22>] :xfs:xfs_inactive+0x22e/0x821 Sep 7 21:08:19 backup kernel: [<ffffffff8832dd66>] :xfs:xfs_validate_fields+0x24/0x4b Sep 7 21:08:19 backup kernel: [<ffffffff8832e361>] :xfs:xfs_vn_unlink+0x3d/0x4f Sep 7 21:08:19 backup kernel: [<ffffffff80063bb7>] mutex_lock+0xd/0x1d Sep 7 21:08:19 backup kernel: [<ffffffff883308fc>] :xfs:xfs_fs_clear_inode+0xa5/0xec Sep 7 21:08:19 backup kernel: [<ffffffff80022861>] clear_inode+0xd2/0x123 Sep 7 21:08:19 backup kernel: [<ffffffff8002f19a>] generic_delete_inode+0xde/0x143 Sep 7 21:08:19 backup kernel: [<ffffffff8003c1f6>] do_unlinkat+0xd5/0x141 Sep 7 21:08:19 backup kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0 Sep 7 21:08:19 backup kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Sep 7 21:08:19 backup kernel: Sep 7 21:08:19 backup kernel: xfs_force_shutdown(sdb1,0x8) called from line 4267 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_bmap.c. Return address = 0xffffffff882f02ea Sep 7 21:08:19 backup kernel: Filesystem "sdb1": Corruption of in-memory data detected. Shutting down filesystem: sdb1 Sep 7 21:08:19 backup kernel: Please umount the filesystem, and rectify the problem(s) Subsequent recovery was as follows: Sep 8 10:59:40 backup kernel: xfs_force_shutdown(sdb1,0x1) called from line 424 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_rw.c. Return address = 0xffffffff883277f1 Sep 8 10:59:40 backup kernel: xfs_force_shutdown(sdb1,0x1) called from line 424 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_rw.c. Return address = 0xffffffff883277f1 Sep 8 11:00:13 backup kernel: XFS mounting filesystem sdb1 Sep 8 11:00:13 backup kernel: Starting XFS recovery on filesystem: sdb1 (logdev: internal) Sep 8 11:00:14 backup kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1561 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_alloc.c. Caller 0xffffffff882e356b Sep 8 11:00:14 backup kernel: Sep 8 11:00:14 backup kernel: Call Trace: Sep 8 11:00:14 backup kernel: [<ffffffff882e1c7e>] :xfs:xfs_free_ag_extent+0x19f/0x67f Sep 8 11:00:14 backup kernel: [<ffffffff882e356b>] :xfs:xfs_free_extent+0xa9/0xc9 Sep 8 11:00:14 backup kernel: [<ffffffff883180e5>] :xfs:xlog_recover_finish+0x15a/0x244 Sep 8 11:00:14 backup kernel: [<ffffffff8831bb00>] :xfs:xfs_mountfs+0xa24/0xc30 Sep 8 11:00:14 backup kernel: [<ffffffff8000c394>] _atomic_dec_and_lock+0x39/0x57 Sep 8 11:00:14 backup kernel: [<ffffffff88321748>] :xfs:xfs_mount+0x762/0x83b Sep 8 11:00:14 backup kernel: [<ffffffff88330c79>] :xfs:xfs_fs_fill_super+0x0/0x1e3 Sep 8 11:00:14 backup kernel: [<ffffffff88330cf7>] :xfs:xfs_fs_fill_super+0x7e/0x1e3 Sep 8 11:00:14 backup kernel: [<ffffffff8006465d>] __down_write_nested+0x12/0x92 Sep 8 11:00:14 backup kernel: [<ffffffff800e72c1>] get_filesystem+0x12/0x3b Sep 8 11:00:14 backup kernel: [<ffffffff800de9b5>] sget+0x365/0x377 Sep 8 11:00:14 backup kernel: [<ffffffff800de301>] set_bdev_super+0x0/0xf Sep 8 11:00:14 backup kernel: [<ffffffff800de310>] test_bdev_super+0x0/0xd Sep 8 11:00:14 backup kernel: [<ffffffff800df2c4>] get_sb_bdev+0x10a/0x164 Sep 8 11:00:14 backup kernel: [<ffffffff800dec61>] vfs_kern_mount+0x93/0x11a Sep 8 11:00:14 backup kernel: [<ffffffff800ded2a>] do_kern_mount+0x36/0x4d Sep 8 11:00:14 backup kernel: [<ffffffff800e8bfb>] do_mount+0x6a7/0x717 Sep 8 11:00:14 backup kernel: [<ffffffff8002c7ef>] mntput_no_expire+0x19/0x89 Sep 8 11:00:14 backup kernel: [<ffffffff8000e87a>] link_path_walk+0xd3/0xe5 Sep 8 11:00:14 backup kernel: [<ffffffff80066bd2>] do_page_fault+0x4fe/0x830 Sep 8 11:00:14 backup kernel: [<ffffffff800076c0>] find_get_page+0x21/0x50 Sep 8 11:00:14 backup kernel: [<ffffffff80013445>] filemap_nopage+0x188/0x322 Sep 8 11:00:14 backup kernel: [<ffffffff80008b8f>] __handle_mm_fault+0x51d/0xe5c Sep 8 11:00:14 backup kernel: [<ffffffff8002331e>] __user_walk_fd+0x41/0x4c Sep 8 11:00:14 backup kernel: [<ffffffff800c877d>] zone_statistics+0x3e/0x6d Sep 8 11:00:14 backup kernel: [<ffffffff8000f10a>] __alloc_pages+0x65/0x2ce Sep 8 11:00:14 backup kernel: [<ffffffff800e0e13>] sys_readlinkat+0x98/0xa9 Sep 8 11:00:14 backup kernel: [<ffffffff8004bf75>] sys_mount+0x8a/0xcd Sep 8 11:00:14 backup kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Sep 8 11:00:14 backup kernel: Sep 8 11:00:14 backup kernel: Ending XFS recovery on filesystem: sdb1 (logdev: internal) Sep 8 11:03:45 backup kernel: XFS mounting filesystem sdb1 The 3Ware card issue occured about a month ago but hasn't returned since: Aug 9 05:56:32 backup kernel: sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card. Aug 9 05:56:35 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 05:57:37 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence. Aug 9 05:58:19 backup kernel: 3w-9xxx: scsi4: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0. Aug 9 05:58:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 05:59:49 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 05:59:49 backup kernel: sd 4:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card. Aug 9 06:00:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:01:19 backup last message repeated 2 times Aug 9 06:01:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:01:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset. Aug 9 06:01:49 backup kernel: sd 4:0:0:0: scsi: Device offlined - not ready after error recovery Aug 9 06:01:49 backup last message repeated 6 times Aug 9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device Aug 9 06:01:49 backup last message repeated 7 times Aug 9 06:01:49 backup kernel: sd 4:0:0:0: SCSI error: return code = 0x00010000 Aug 9 06:01:49 backup kernel: end_request: I/O error, dev sdb, sector 4394574486 Aug 9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device Aug 9 06:01:49 backup last message repeated 2 times Aug 9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x105efe274 ("xlog_iodone") error 5 buf count 30208 Aug 9 06:01:49 backup kernel: xfs_force_shutdown(sdb1,0x2) called from line 956 of file /home/buildsvn/rpmbuild/BUILD/xfs-kmod-0.4/_kmod_build_/xfs_log.c. Return address = 0xffffffff88313780 Aug 9 06:01:49 backup kernel: Filesystem "sdb1": Log I/O Error Detected. Shutting down filesystem: sdb1 Aug 9 06:01:49 backup kernel: Please umount the filesystem, and rectify the problem(s) Aug 9 06:01:49 backup kernel: sd 4:0:0:0: rejecting I/O to offline device Aug 9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x1446a8 ("xfs_trans_read_buf") error 5 buf count 4096 Aug 9 06:01:49 backup kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x1c16a8 ("xfs_trans_read_buf") error 5 buf count 4096 Aug 9 06:02:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 06:03:04 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 06:03:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:04:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:04:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 06:05:19 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 06:05:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:06:19 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:06:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Aug 9 06:07:34 backup kernel: 3w-9xxx: scsi4: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. Aug 9 06:08:04 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:08:34 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. Aug 9 06:08:49 backup kernel: 3w-9xxx: scsi4: ERROR: (0x06:0x0010): Microcontroller Error: clearing. repeat last three lines ad infinitum. Can anyone shed any light as to what the issue might be? Is it XFS or is it the card? I went with XFS as all the research I did indicated that it gave the best performance for the millions of small files in the backups we run and doesn't have any many issues with hardlinks as EXT3. - J Share your memories online with anyone you want anyone you want. |
_______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos