I have a large JBOD attached to my server via an LSI SAS2308 PCI card(mpt2sas driver). I've got about 40 drives right now assembled into 4 Linux software RAID sets and I am using those RAID volumes as back end devices for GPFS. Everything was working fine about a week ago when I had 20 drives and 2 RAID volumes then I added 20 new disks, all the same model, and now I am frequently seeing all the devices behind the SAS card reporting device_blocked immediately followed by device_unblocked. These events are correlated with a period of many seconds of no data throughput. This is happening often enough to cause major throughput problems. I have seen similar problem in the past, but they were accompanied by some kind of disk specific error and I could fix the situation by removing the disk. In this case there are no other errors in any log besides the device_blocked and device_unblocked on every single device. This system is not in production yet so I can blow it all away if I need to, but I really want to understand what is causing this so that if it does come back once we go into production I'll be able to fix it without major disruptions. I suspect there is a misbehaving drive, but there is nothing pointing to a single drive and I could be completely wrong about that. Does anybody have any clue where to look? Here is what the error logs look like: Jun 11 19:29:17 storage003 kernel: sd 6:0:0:0: device_blocked, handle(0x0016) Jun 11 19:29:17 storage003 kernel: sd 6:0:1:0: device_blocked, handle(0x000b) Jun 11 19:29:17 storage003 kernel: sd 6:0:2:0: device_blocked, handle(0x000c) Jun 11 19:29:17 storage003 kernel: ses 6:0:3:0: device_blocked, handle(0x000e) Jun 11 19:29:17 storage003 kernel: sd 6:0:4:0: device_blocked, handle(0x000f) Jun 11 19:29:17 storage003 kernel: sd 6:0:5:0: device_blocked, handle(0x0010) ... Same thing for the rest of the devices on host6 Jun 11 19:29:18 storage003 kernel: sd 6:0:0:0: device_unblocked and set to running, handle(0x0016) Jun 11 19:29:18 storage003 kernel: sd 6:0:1:0: device_unblocked and set to running, handle(0x000b) Jun 11 19:29:18 storage003 kernel: sd 6:0:2:0: device_unblocked and set to running, handle(0x000c) Jun 11 19:29:18 storage003 kernel: ses 6:0:3:0: device_unblocked and set to running, handle(0x000e) Jun 11 19:29:18 storage003 kernel: sd 6:0:4:0: device_unblocked and set to running, handle(0x000f) Jun 11 19:29:18 storage003 kernel: sd 6:0:5:0: device_unblocked and set to running, handle(0x0010) ... Same thing for the rest of the devices again. Thanks, Mike Robbert
Attachment:
smime.p7s
Description: S/MIME cryptographic signature