"Also sprach Molle Bestefich:" [Charset ISO-8859-1 unsupported, filtering to ASCII...] > Peter T. Breuer wrote: > > 1) I would like raid request retries to be done with exponential > > delays, so that we get a chance to overcome network brownouts. > > > > I presume the former will either not be objectionable > > You want to hurt performance for every single MD user out there, just There's no performance drop! Exponentially staged retries on failure are standard in all network protocols ... it is the appropriate reaction in general, since stuffing the pipe full of immediate retries doesn't allow the would-be successful transactions to even get a look in against that competition. > because things doesn't work optimally under enbd, which is after all a > rather rare use case compared to using MD on top of real disks. Strawman. > Uuuuh.. yeah, no objections there. > > Besides, it seems a rather pointless exercise to try and hide the fact > from MD that the device is gone, since it *is* in fact missing. Well, we don't really know that for sure. As you know, it is impossible to tell in general if the net has gone awol or is simply heavily overloaded (with retry requests). The retry on error is a good thing. I am simply suggesting that if the first retry also fails that we do some back off before trying again, since it is now likely (lacking more knowledge) that the device is having trouble and may well take some time to recover. I would suspect that an interval of 0 1 5 10 30 60s would be appropriate for retries. One can cycle that twice for luck before giving up for good, if you like. The general idea in such backoff protocols is that it avoids filling a fixed bandwidth channel with retries (the sum of a constant times 1 + 1/2 + 1/4 + .. is a finite proportion of the channel bandwidth, but the sum of 1+1+1+1+1+... is unbounded), but here also there is an _additional_ assumption that the net is likely to have brownouts and so we _ought_ to retry at intervals since retrying immediately will definitely almost always do no good. > Seems > wrong at the least. There is no effect on the normal request path, and the effect is beneficial to successful requests by reducing the competing buildup of failed requests, when they do occur. In "normal " failures there is zero delay anyway. And further, the bitmap takes care of delayed responses in the normal course of events. > > 2) I would like some channel of communication to be available > > with raid that devices can use to say that they are > > OK and would they please be reinserted in the array. > > > > The latter is the RFC thing > > It would be reasonable for MD to know the difference between > - "device has (temporarily, perhaps) gone missing" and > - "device has physical errors when reading/writing blocks", I agree. The problem is that we can't really tell what's happening (even in the lower level device) across a net that is not responding. Enbd generally hides the problem for a short period of time, then gives up and advises md (if only it could nowadays - I mean with the fr1 patch) that the device is down, and then tells md when the device comes back, so that the bitmap can be discharged and the device be caught up. The problem is that at the moment the md layer has no way of being told that the device is OK again (and that it decides on its own account that the device is bad when it sends umpteen retries within a short period of time only to get them all rejected). > because if MD knew that, then it would be trivial to automatically > hot-add the missing device once available again. Whereas the faulty > one would need the administrator to get off his couch. Yes. The idea is that across the net approximately ALL failures are temporary ones, to a value of something like 99.99%. The cleaning lady is usually dusting the on-off switch on the router. > This would help in other areas too, like when a disk controller dies, > or a cable comes (completely) loose. > > Even if the IDE drivers are not mature enough to tell us which kind of > error it is, MD could still implement such a feature just to help > enbd. > > I don't think a comm-channel is the right answer, though. > > I think the type=(missing/faulty) information should be embedded in > the I/O error message from the block layer (enbd in your case) > instead, to avoid race conditions and allow MD to take good decisions > as early as possible. That's a possibility. I certainly get two types of error back in the enbd driver .. remote error or network error. Remote error is when we get told by the other end that the disk has a problem. Network error is when we hear nothing, and have a timeout. I can certainly pass that on. Any suggestions? > The comm channel and "hey, I'm OK" message you propose doesn't seem > that different from just hot-adding the disks from a shell script > using 'mdadm'. Talking through userspace has subtle deadlock problems. I wouldn't rely on it in this kind of situation. Blocking a device can lead to a file system being blocked and processes getting stalled for all kinds of peripheral reasons, for example. I have seen file descriptor closes getting blocked, to name the bizarre. I am pretty sure that removal requests will be blocked when requests are outstanding. Another problem is that enbd has to _know_ it is in a raid array, and which one, in order to send the ioctl. That leads one to more or less require that the md array tell it. One could build this into the mdadm tool, but one can't guarantee that everyone uses that (same) mdadm tool, so the md driver gets nominated as the best place for the code that does that. > > When the device felt good (or ill) it notified the raid arrays it > > knew it was in via another ioctl (really just hot-add or hot-remove), > > and the raid layer would do the appropriate catchup (or start > > bitmapping for it). > > No point in bitmapping. Since with the network down and all the > devices underlying the RAID missing, there's nowhere to store data. > Right? Only one of two devices in a two-device mirror is generally networked. The standard scenario is two local disks per network node. One is a mirror half for a remote raid, the other is the mirror half for a local raid (which has a remote other half on the remote node). More complicated setups can also be built - there are entire grids of such nodes arranged in a torus, with local redundancy arranged in groups of three neighbours, each with two local devices and one remote device. Etc. > Some more factual data about your setup would maybe be good.. It's not my setup! Invent your own :-). > > all I can do is make the enbd device block on network timeouts. > > But that's totally unsatisfactory, since real network outages then > > cause permanent blocks on anything touching a file system > > mounted remotely. People don't like that. > > If it's just this that you want to fix, you could write a DM module > which returns I/O error if the request to the underlying device takes > more than 10 seconds. I'm not sure that another layer helps. I can timeout requests myself in 10s within enbd if I want to. The problem is that if I take ten seconds for each one when the net is down memory will fill with backed up requests. The first one that is failed (after 10s) then triggers an immediate retry from md, which also gets held for 10s. We'll simply get huge pulses of failures of entire backed up memory spaced at 10s. :-o I'm pretty sure from reports that md would error the device offline after a pulse like that. If it doesn't, then anyway enbd would decide after 30s or so that the remote end was down and take itself offline. One or the other would cause md to expell it from the array. I could try hot-add from enbd when the other end comes back, but we need to know we are in an array (and which) in order to do that. > Layer that module on top of the RAID, and make your enbd device block > on network timeouts. It shifts the problem to no avail, as far as I understand you, and my understanding is likely faulty. Can you be more specific about how this attacks the problem? > Now the RAID array doesn't see missing disks on network outages, and It wouldn't see them anyway when enbd is in normal mode - it blocks. The problem is that that behaviour is really bad for user satisfaction! Enbd used instead to tell the md device that it was feeling ill, error all requests, allowing md to chuck it out of the array. Then enbd would tell the md device when it was feeling well again, and make md reinsert it in the array. Md would catch up using the bitmap. Right now, we can't really tell md we're feeling ill (that would be a HOT_ARRRGH, but md doesn't have that). If we could, then md could decide on its own to murder all outstanding requests for us and chuck us out, with the implicit understanding that we will come back again soon and then the bitbap can catcj us up. We can't do a HOT_REMOVE while requests are outstanding, as far as I know. > users get near-instant errors when the array isn't responsive due to a > network outage. I agree that the lower level device should report errors quickly up to md. The problem is that that leads to it being chucked out unceremonially, for ever and a day .. 1) md shouldn't chuck us out for a few errors - nets are like that 2) we should be able to chuck ourselves out when we feel the net is weak 3) we should be able to chuck ourselves back in when we feel better 4) for that to happen, we need to have been told by md when we are in an array and which I simply proposed that (1) has the easy solution of md doing retries with exponential backoff for a while, instead of chucking us out. The rest needs discussion. Maybe it can be done in userspace, but be advised that I think that is remarkably tricky! In particular, it's almost impossible to test adequately ... which alone would make me aim for an embedded solution (i.e. driver code). Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html