Re: remark and RFC

"Molle Bestefich" <molle.bestefich@xxxxxxxxx> · Wed, 16 Aug 2006 16:28:15 +0200

Peter T. Breuer wrote:
> You want to hurt performance for every single MD user out there, just

There's no performance drop!  Exponentially staged retries on failure
are standard in all network protocols ...  it is the appropriate
reaction in general, since stuffing the pipe full of immediate retries
doesn't allow the would-be successful transactions to even get a look in
against that competition.

That's assuming that there even is a pipe, which is something specific
to ENBD / networked block devices, not something that the MD driver
should in general care about.

> because things doesn't work optimally under enbd, which is after all a
> rather rare use case compared to using MD on top of real disks.

Strawman.

Quah?

> Besides, it seems a rather pointless exercise to try and hide the fact
> from MD that the device is gone, since it *is* in fact missing.

Well, we don't really know that for sure.  As you know, it is
impossible to tell in general if the net has gone awol or is simply
heavily overloaded (with retry requests).

From MD's point of view, if we're unable to complete a request to the
device, then it's either missing or faulty.  If a call to the device
blocks, then it's just very slow.

I don't think it's wise to pollute these simple mechanics with a
"maybe it's in a sort-of failing due to a network outage, which might
just be a brownout" scenario.  Better to solve the problem in a more
appropriate place, somewhere that knows about the fact that we're
simulating a block device over a network connection.

Not introducing network-block-device aware code in MD is a good way to
avoid wrong code paths and weird behaviour for real block device
users.

"Missing" vs. "Faulty" is OTOH a pretty simple interface, which maps
fine to both real disks and NBDs.

The retry on error is a good thing.  I am simply suggesting that if the
first retry also fails that we do some back off before trying again,
since it is now likely (lacking more knowledge) that the device is
having trouble and may well take some time to recover.  I would suspect
that an interval of 0 1 5 10 30 60s would be appropriate for retries.

Only for networked block devices.

Not for real disks, there you are just causing unbearable delays for
users for no good reason, in the event that this code path is taken.

One can cycle that twice for luck before giving up for good, if you
like.  The general idea in such backoff protocols is that it avoids
filling a fixed bandwidth channel with retries (the sum of a constant
times 1 + 1/2 + 1/4 + ..  is a finite proportion of the channel
bandwidth, but the sum of 1+1+1+1+1+...  is unbounded), but here also
there is an _additional_ assumption that the net is likely to have
brownouts and so we _ought_ to retry at intervals since retrying
immediately will definitely almost always do no good.

Since the knowledge that the block device is on a network resides in
ENBD, I think the most reasonable thing to do would be to implement a
backoff in ENBD?  Should be relatively simple to catch MD retries in
ENBD and block for 0 1 5 10 30 60 seconds.  That would keep the
network backoff algorithm in a more right place, namely the place that
knows the device is on a network.

In "normal " failures there is zero delay anyway.

Since the first retry would succeed, or?
I'm not sure what this "normal" failure is, btw.

And further, the bitmap takes care of delayed
responses in the normal course of events.

Mebbe.  Does it?

> It would be reasonable for MD to know the difference between
>  - "device has (temporarily, perhaps) gone missing" and
>  - "device has physical errors when reading/writing blocks",

I agree. The problem is that we can't really tell what's happening
(even in the lower level device) across a net that is not responding.

In the case where requests can't be delivered over the network (or a
SATA cable, whatever), it's a clear case of "missing device".

> because if MD knew that, then it would be trivial to automatically
> hot-add the missing device once available again.  Whereas the faulty
> one would need the administrator to get off his couch.

Yes. The idea is that across the net approximately ALL failures are
temporary ones, to a value of something like 99.99%.  The cleaning lady
is usually dusting the on-off switch on the router.

> This would help in other areas too, like when a disk controller dies,
> or a cable comes (completely) loose.
>
> Even if the IDE drivers are not mature enough to tell us which kind of
> error it is, MD could still implement such a feature just to help
> enbd.
>
> I don't think a comm-channel is the right answer, though.
>
> I think the type=(missing/faulty) information should be embedded in
> the I/O error message from the block layer (enbd in your case)
> instead, to avoid race conditions and allow MD to take good decisions
> as early as possible.

That's a possibility. I certainly get two types of error back in the
enbd driver .. remote error or network error. Remote error is when
we get told by the other end that the disk has a problem. Network
error is when we hear nothing, and have a timeout.

I can certainly pass that on. Any suggestions?

Let's hear from Neil what he thinks.

> The comm channel and "hey, I'm OK" message you propose doesn't seem
> that different from just hot-adding the disks from a shell script
> using 'mdadm'.

[snip speculations on possible blocking calls]

You could always try and see.
Should be easy to simulate a network outage.

I am pretty sure that removal requests will be blocked when
requests are outstanding.

That in particular should not be a big problem, since MD already kicks
the device for you, right?  A script would only have to hot-add the
device once it's available again.

Another problem is that enbd has to _know_ it is in a raid array, and
which one, in order to send the ioctl.  That leads one to more or less
require that the md array tell it.  One could build this into the mdadm
tool, but one can't guarantee that everyone uses that (same) mdadm tool,
so the md driver gets nominated as the best place for the code that
does that.

It's already in mdadm.

You can only usefully query one way (array --> device):
# mdadm -D /dev/md0 | grep -A100 -E '^    Number'

   Number   Major   Minor   RaidDevice State
      0     253        0        0      active sync   /dev/mapper/sda1
      1     253        1        1      active sync   /dev/mapper/sdb1

That should provide you with enough information though, since devices
stay in that table even after they've gone missing.  (I'm not sure
what happens when a spare takes over a place, though - test needed.)

The optimal thing would be to query the other way, of course.  ENBD
should be able to tell a hotplug shell script (or whatever) about the
name of the device that's just come back.

And you *can* in fact query the other way too, but you won't get a
useful Array UUID or device-name-of-assembled-array out of it:

# mdadm -E /dev/mapper/sda2
[snip blah, no array information :-(]

Expanding -E output to include the Array UUID would be a good feature
in any case.

Expanding -E output to include which array device is currently
mounted, having the corresponding Array UUID would be neat, but I'm
sure that most users would probably misunderstand what this means :-).

Only one of two devices in a two-device mirror is generally networked.

Makes sense.

The standard scenario is two local disks per network node.  One is a
mirror half for a remote raid,

A local cache of sorts?

the other is the mirror half for a local raid
(which has a remote other half on the remote node).

A remote backup of sorts?

More complicated setups can also be built - there are entire grids of
such nodes arranged in a torus, with local redundancy arranged in
groups of three neighbours, each with two local devices and one remote
device. Etc.

Neat ;-).

> > all I can do is make the enbd device block on network timeouts.
> > But that's totally unsatisfactory, since real network outages then
> > cause permanent blocks on anything touching a file system
> > mounted remotely.  People don't like that.
>
> If it's just this that you want to fix, you could write a DM module
> which returns I/O error if the request to the underlying device takes
> more than 10 seconds.

I'm not sure that another layer helps. I can timeout requests myself in
10s within enbd if I want to.

Yeah, okay.
I suggested that further up, but I guess you thought of it before I did :-).

The problem is that if I take ten seconds for each one when the
net is down memory will fill with backed up requests.  The first
one that is failed (after 10s) then triggers an immediate retry
from md, which also gets held for 10s.  We'll simply get
huge pulses of failures of entire backed up memory spaced at 10s.
I'm pretty sure from reports that md would error the device
offline after a pulse like that.

I don't see where these "huge pulses" come into the picture.

If you block one MD request for 10 seconds, surely there won't be
another before you return an answer to that one?

If it doesn't, then anyway enbd would decide after 30s or so that
the remote end was down and take itself offline.
One or the other would cause md to expell it from the array.  I could
try hot-add from enbd when the other end comes back, but we need to know
we are in an array (and which) in order to do that.

I think that's possible using mdadm at least.

> Layer that module on top of the RAID, and make your enbd
> device block on network timeouts.

It shifts the problem to no avail, as far as I understand you, and my
understanding is likely faulty.  Can you be more specific about how this
attacks the problem?

Never was much of a good explainer...

I was of the impression that you wanted an error message to be
propagated quickly to userspace / users, but the MD array to just be
silently paused, whenever a network outage occurred.

Since you've mentioned that there's actually local disk components in
the RAID arrays, I imagine you would want the array to NOT be paused,
since it could reasonably continue operation on one device.  So just
forget about that proposal, it won't work in this situation :-).

I guess what will work is either:

A)

Network outage -->
ENBD fails disk -->
MD drops disk -->
Network comes back -->
ENBD brings disk back up -->
Something kicks off /etc/hotplug.d/block-hotplug script -->
Script queries all RAID devices and find where the disk fits -->
Script hot-adds the disk

Or:

B)

Network outage -->
ENBD fails disk, I/O error type "link error" -->
MD sets disk status to "temporarily missing" -->
Network comes back -->
ENBD brings disk back up -->
MD sees a block device arrival, reintegrates the disk into array

I think the latter is better, because:
* Noone has to maintain husky shell scripts
* It sends a nice message to the SATA/PATA/SCSI people that MD would
really like to know whether it's a disk or a link problem.

But then again, shell scripts _is_ the preferred Linux solution to...
Everything.

Enbd used instead to tell the md device that it was feeling ill, error
all requests, allowing md to chuck it out of the array. Then enbd would
tell the md device when it was feeling well again, and make md
reinsert it in the array. Md would catch up using the bitmap.

Right now, we can't really tell md we're feeling ill (that would be a
HOT_ARRRGH, but md doesn't have that). If we could, then md could
decide on its own to murder all outstanding requests for us and
chuck us out, with the implicit understanding that we will come back
again soon and then the bitbap can catcj us up.

We can't do a HOT_REMOVE while requests are outstanding, as far as I
know.

MD should be fixed so HOT_REMOVE won't fail but will just kick the
disk, even if it happens to be blocking on I/O calls.

(If there really is a reason not to kick it, then at least a
HOT_REMOVE_FORCE should be added..)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html