HI Neil .. "Also sprach Neil Brown:" > On Wednesday August 16, ptb@xxxxxxxxxxxxxx wrote: > > 1) I would like raid request retries to be done with exponential > > delays, so that we get a chance to overcome network brownouts. > > > > 2) I would like some channel of communication to be available > > with raid that devices can use to say that they are > > OK and would they please be reinserted in the array. > > > > The latter is the RFC thing (I presume the former will either not > > be objectionable or Neil will say "there's no need since you're wrong > > about the way raid does retries anyway"). > > There's no need since you're ..... you know the rest :-) > Well, sort of. OK, let's see ... > When md/raid1 gets a read error it immediately retries the request in > small (page size) chunks to find out exactly where the error is (it > does this even if the original read request is only one page). OK. I didn't know that. But do you mean a read request to the RAID device, or a read request to the underlying disk device? The latter might form part of the implementation of a write request to the RAID device. (has the MD blocksize moved up to 4K then? It was at 1KB for years) > When it hits a read error during retry, it reads from another device > (if it can find one that works) and writes what it got out to the > 'faulty' drive (or drives). OK. That mechanism I was aware of. > If this works: great. > If not, the write error causes the drive to be kicked. Yerrs, that's also what I thought. > I'm not interested in putting any delays in there. It is simply the > wrong place to put them. If network brownouts might be a problem, > then the network driver gets to care about that. I think you might want to reconsider (not that I know the answer). 1) if the network disk device has decided to shut down wholesale (temporarily) because of lack of contact over the net, then retries and writes are _bound_ to fail for a while, so there is no point in sending them now. You'd really do infinitely better to wait a while. 2) if the network device just blocks individual requests for say 10s while waiting for an ack, then times them out, there is more chance of everything continuing to work since the 10s might be long enough for the net to recover in, but occasionally a single timeout will occur and you will boot the device from the array (whereas waiting a bit longer would have been the right thing to do, if only we had known). Change 10s to any reasonable length of time. You think the device has become unreliable because write failed, but it hasn't ... that's just the net. Try again later! If you like we can set the req error count to -ETIMEDOUT to signal it. Real remote write breakage can be signalled with -EIO or something. Only boot the device on -EIO. 3) if the network device blocks essentially forever, waiting for a reconnect, experience says that users hate that. I believe the md array gets stuck somewhere here (from reports), possibly in trying to read the superblock of the blocked device. 4) what the network device driver wants to do is be able to identify the difference between primary requests and retries, and delay retries (or repeat them internally) with some reasonable backoff scheme to give them more chance of working in the face of a brownout, but it has no way of doing that. You can make the problem go away by delaying retries yourself (is there a timedue field in requests, as well as a timeout field? If so, maybe that can be used to signal what kind of a request it is and how to treat it). > Point 2 should be done in user-space. It's not reliable - we will be under memory pressure at this point, with all that implies; the raid device might be the very device on which the file system sits, etc. Pick your poison! > - notice device have been ejected from array > - discover why. act accordingly. > - if/when it seems to be working again, add it back into the array. > > I don't see any need for this to be done in the kernel. Because there might not be any userspace (embedded device) and userspace might be blocked via subtle or not-so-subtle deadlocks. There's no harm in making it easy! /proc/mdstat is presently too hard to parse reliably, I am afraid. Minor differences in presentation arise in it for reasons I don't understand! > > The way the old FR1/5 code worked was to make available a couple of > > ioctls. > > > > When a device got inserted in an array, the raid code told the device > > via a special ioctl it assumed the device had that it was now in an > > array (this triggers special behaviours, such as deliberately becoming > > more error-prone and less blocky, on the assumption that we have got > > good comms with raid and can manage our own raid state). Ditto > > removal. > > A bit like BIO_RW_FASTFAIL? Possibly md could make more use of that. It was a different one, but yes, that would have done. The FR1/5 code needed to be told also in WHICH array it was, so that it could send ioctls (HOT_REPAIR, or such) to the right md device later when it felt well again. And it needed to be told when it was ejected from the array, so as not to do that next time ... > I haven't given it any serious thought yet. I don't even know what > low level devices recognise it or what they do in response. As far as I am concerned, any signal is useful. Any one which tells me which array I am in is especially useful. And I need to be told when I leave. Essentially I want some kernel communication channel here. Ioctls are fine (there is a subtle kernel deadlock involved in calling an ioctl on a device above you from within, but I got round that once, and I can do it again). Thanks for the replies! Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html