Oooh, that ~3 second patch sounds very interesting. I actually think that the theory about timeouts causing the problem is correct. I didn't realize that applications/fs calls could stall for that long. My NFS servers have a timeout themselves of about 10 seconds before they start to try to shut things down. --David Dougall On Thu, 20 Jan 2005, Mark Bellon wrote: > Gordon Henderson wrote: > > >On Thu, 20 Jan 2005, David Dougall wrote: > > > > > > > >>Perhaps I was asking a stupid question or an obvious one, but I have > >>received not response. > >>Maybe if I simplify the question... > >> > >>If I am running software raid1 and a disk device starts throwing I/O > >>errors, Is the filesystem supposed to see any indication of this? > >> > >> > > > >No.. > > > > > > > >> I > >>thought software raid would mask all of this and just fail the drive. > >> > >> > > > >It should. > > > > > > > >>I have servers with xfs as the filesystem and xfs will start to throw I/O > >>errors when a disk starts acting up even with software raid in between. > >>Please advise on how I can confirm my setup or if this is possibly a bug > >>how to diagnose further. > >> > >> > > > >I've experienced long delays (30 seconds? It seemed longer) in a system > >when a disk fails for a genuine reason - (I've deliberately run badblocks > >on an md device when I knew one of the underlying devices had genuine bad > >blocks) maybe the md code really tries hard to read the block, maybe the > >underlying device driver tries really hard), but in these cases, I've seen > >the system more or less freeze (all processes accessing that device > >anyway) until the raid code decided to kick the device out of the array. > > > > > I've seen this too. The worst case can actually last for over 2 minutes. > > We've been running with a patch to the RAID 1 driver that handles this > so critical applications do not hang for too long. Basically it uses > timers in the RAID 1 driver to force the disk to be treated as actually > having failed if it doesn't respond within a reasonable time (tunable > but usually ~3 seconds). It then handles the I/O requests coming back > async. and does the clean up. > > >Maybe XFS has a timer and doesn't like devices to "go away" for a long period of time? > > > > > Not that I know of but I would need to look. Any XFS wizard's comments? > > mark > > > > > > >>If it makes a difference, I am running linux-2.4.26 > >> > >> > > > >I've used 2.4.x for a long time - I did try xfs about a year ago, but > >wasn't happy with it all (for various reasons). > > > >Gordon > >- > >To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >the body of a message to majordomo@xxxxxxxxxxxxxxx > >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html