Hi, I'm not quite sure this is fully on-topic, apologies for the disturbance. Maybe someone here has experience with this. We've bought some new database servers with MegaSAS MR9260-4i. Attached are two Seagate ST3600057SS (15k SAS disks) in a RAID1 configuration (and a SSD in RAID0 = single disk, but that is not in use here). The controller runs the most recent firmware (2.13). We noticed absurdly slow write performance and kernel backtraces like [ 2041.527947] INFO: task scsi_id:2915 blocked for more than 120 seconds. [ 2041.527951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2041.527955] scsi_id D ffffffff81609d40 0 2915 2804 0x00000004 [ 2041.527962] ffff8803536a5b48 0000000000000082 ffff8803536a5a68 ffffffff00000000 [ 2041.527971] ffff8803536a5fd8 ffff8803536a5fd8 ffff8803536a4000 0000000000013780 [ 2041.527975] 0000000000013780 ffff8803536a5fd8 ffffffff81a0b020 ffff8803536a2500 [ 2041.527978] Call Trace: [ 2041.527986] [<ffffffff81030a04>] ? do_page_fault+0x358/0x394 [ 2041.527990] [<ffffffff814fa80e>] ? common_interrupt+0xe/0x13 [ 2041.527994] [<ffffffff8104314b>] ? mutex_spin_on_owner+0x44/0x78 [ 2041.527998] [<ffffffff814f8f9e>] __mutex_lock_slowpath+0x116/0x18b [ 2041.528002] [<ffffffff8112a4d8>] ? blkdev_open+0x0/0x6e [ 2041.528004] [<ffffffff814f8996>] mutex_lock+0x18/0x2f [ 2041.528007] [<ffffffff81129f5c>] __blkdev_get+0x73/0x348 [ 2041.528009] [<ffffffff8112a4d8>] ? blkdev_open+0x0/0x6e [ 2041.528012] [<ffffffff8112a3f3>] blkdev_get+0x1c2/0x2a7 [ 2041.528016] [<ffffffff81109d91>] ? do_lookup+0x1da/0x288 [ 2041.528020] [<ffffffff81293035>] ? aufs_permission+0x27d/0x28f [ 2041.528022] [<ffffffff8112a4d8>] ? blkdev_open+0x0/0x6e [ 2041.528025] [<ffffffff8112a542>] blkdev_open+0x6a/0x6e [ 2041.528028] [<ffffffff810fe9aa>] __dentry_open.isra.15+0x1ce/0x2e5 [ 2041.528031] [<ffffffff810ff700>] nameidata_to_filp+0x48/0x4f [ 2041.528034] [<ffffffff8110b9d7>] finish_open+0xa1/0x155 [ 2041.528037] [<ffffffff8110aa8e>] ? do_path_lookup+0x69/0xcf [ 2041.528039] [<ffffffff8110bed9>] do_filp_open+0x178/0x609 [ 2041.528043] [<ffffffff810da74f>] ? handle_mm_fault+0x262/0x275 [ 2041.528046] [<ffffffff810dcb18>] ? unmap_region+0x138/0x16d [ 2041.528049] [<ffffffff811164ce>] ? alloc_fd+0x109/0x11b [ 2041.528052] [<ffffffff810ff767>] do_sys_open+0x60/0xf9 [ 2041.528054] [<ffffffff810ff820>] sys_open+0x20/0x22 [ 2041.528058] [<ffffffff8100ab82>] system_call_fastpath+0x16/0x1b on one of the systems. mkfs.ext4 would take 20-30 minutes for the creation of a 500GB filesystem, during which iostat would show 100% utilization of the logical disk. The system was also extremely slow to react, executing a program that had not run before (was not in the cache) took up to 30 seconds during that mkfs run. After some unsuccessful experiments with the IO scheduler I'm now reasonably sure that one of the disks is faulty. RAID-0 out of [252:1]: good RAID-0 out of [252:2]: bad RAID-1 out of [252:1,252:2]: bad ^ mark [252:2] offline: good In this case the detection was reasonably easy because the system wasn't in production yet, but I can't just destroy a volume every time. Problem is, I see exactly no hint anywhere that this particular disk might have a problem. # ./megacli -PDList -a0 Enclosure Device ID: 252 Slot Number: 2 Device Id: 4 Sequence Number: 9 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS [...] There is nothing in the Event log, there is nothing visible in -PhyErrorCounters. smartctl -d megaraid,0 /dev/sda does not work on this platform either (INQUIRY failed, version 2011-06-09 r3365). What would be the best way to debug such a problem in the future? I could not yet look into the WebBIOS thingy because the machine is 50km away and the IP-KVM is broken, but I don't expect to see much there. Thanks, Bernhard -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html