> I have the same or a similar issue running 2.6.17 SMP x86_64 - the > megaraid_sas driver hangs waiting for commands and then the filesystem > unmounts, leaving the machine in an unusable state until there is a hard > reboot (the machine is responsive but any access, shell or otherwise, is > impossible without the filesystem). While I do not have much debugging > information available, this happens to me about once every 6-7 days in > my pool of seven machines, so I can probably get debugging info. Since > the disk is offline and I can't get remote console, I don't have any > details except something similar to Dave Lloyd's post, below. Brett, is this still happening to you? We're seeing this very sporadically, but it does concern us. We've seen driver updates in 2.6.19 (v00.00.03.05) and a new Dell PERC 5/i firmware: Package Version - 5.0.2-0003 Firmware Version - 1.00.01-0157 SASBIOS Version - MT23 Ctrl-R Version - 1.02-007 MPT Version - 00.06.71.00-IT and haven't been able to reproduce it, but we can't find a test case to reliably reproduce the problem to know that anything was fixed (out of 31 identically configured Dell 2950's with the PERC 5/i RAID controller (configured with 6 300MB SAS drives in a RAID 5, most (all?) of them Maxtor Atlas 10k, not hot spare). Our 2950s do have 16GB of RAM each, so the firmware update (which mentions that it fixes DMA beyond 8GB) sounds promising, but I would think that if that was the problem we were experiencing, we would reproduce this much more often? We are certainly using the RAM for cache and memory, it's not like we've never touched beyond 8GB. Does anyone have a test case to reproduce this problem reliably, or a detailed description of what actually happens (on low levels) when this problem occurs that can help to make a test? We are more interested in making this reproducible now than in finding a workaround... if anyone has any tips on how to make this *more* likely to happen we'd like to know (so far, I know to try to use XFS and enable ReadAhead). We have seen this correlated with Patrol Reads going on at the same time, but aren't sure if this is a red herring, and haven't been able to force the issue to happen by enabling Patrol Reads. We've only ever seen these on two machines - one machine reproduces the problem in a little over a week, and the other has reproduced it a small number of times. The machines that reproduce it run an experimental demo workload, but we have not found a test case so far to reproduce the problem on demand to find or verify solutions. We're currently swapping out machines to verify that there are no hardware problems, but the machines diagnose themselves cleanly, and the workload they run is different enough that something about the workload we can't yet synthesize into a test case is the problem. Thank you! Joe Malicki Software Engineer Metacarta, Inc. email: jmalicki@xxxxxxxxxxxxx > The only thing that the machines with these failures seem to have in > common is the fact that they are almost exclusively writes - they are > slave database machines with large memory and pretty much just > replicate. The read/write machines seem to have less failures. > > I am happy to help provide debugging information in any reasonable way. > In the mean time, if there are any known suggestions or workarounds for > the problem, I would be grateful for the guidance. > > Here are what details on the controller. If you want additional info, > let me know exactly what you need and I will do what I can to get it to > you.: > > Product Name : PERC 5/i Integrated > Serial No : 12345 > FW Package Build: 5.0.1-0030 > FW Version : 1.00.01-0088 > BIOS Version : MT23 > Ctrl-R Version :1.02-007 > > B- - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html