end to end error recovery musings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In the IO/FS workshop, one idea we kicked around is the need to provide better and more specific error messages between the IO stack and the file system layer.

My group has been working to stabilize a relatively up to date libata + MD based box, so I can try to lay out at least one "appliance like" typical configuration to help frame the issue. We are working on a relatively large appliance, but you can buy similar home appliance (or build them) that use linux to provide a NAS in a box for end users.

The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) drives, with some of the small system partitions on a 4-way RAID1 device. The libata version we have is back port of 2.6.18 onto SLES10, so the error handling at the libata level is a huge improvement over what we had before.

Each box has a watchdog timer that can be set to fire after at most 2 minutes.

(We have a second flavor of this box with an ICH5 and P-ATA drives using the non-libata drivers that has a similar use case).

Using the patches that Mark sent around recently for error injection, we inject media errors into one or more drives and try to see how smoothly error handling runs and, importantly, whether or not the error handling will complete before the watchdog fires and reboots the box. If you want to be especially mean, inject errors into the RAID superblocks on 3 out of the 4 drives.

We still have the following challenges:

(1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case.

(2) the patches that were floating around on how to make sure that we effectively handle single sector errors in a large IO request are critical. On one hand, we want to combine adjacent IO requests into larger IO's whenever possible. On the other hand, when the combined IO fails, we need to isolate the error to the correct range, avoid reissuing a request that touches that sector again and communicate up the stack to file system/MD what really failed. All of this needs to complete in tens of seconds, not multiple minutes.

(3) The timeout values on the failed IO's need to be tuned well (as was discussed in an earlier linux-ide thread). We cannot afford to hang for 30 seconds, especially in the MD case, since you might need to fail more than one device for a single IO. Prompt error prorogation (say that 4 times quickly!) can allow MD to mask the underlying errors as you would hope, hanging on too long will almost certainly cause a watchdog reboot...

(4) The newish libata+SCSI stack is pretty good at handling disk errors, but adding in MD actually can reduce the reliability of your system unless you tune the error handling correctly.

We will follow up with specific issues as they arise, but I wanted to lay out a use case that can help frame part of the discussion. I also want to encourage people to inject real disk errors with the Mark patches so we can share the pain ;-)

ric



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux