Re: RFC: multipath IO multiplex

Christophe Varoqui <christophe.varoqui@xxxxxxxxx> · Sun, 07 Nov 2010 11:30:49 +0100

Wouldn't it practical to bypass mpio completely on submit your io to the paths instead ?

Cheers,

cvaroqui

----- Message d'origine -----

> On 2010-11-06T11:51:02, Alasdair G Kergon <agk@xxxxxxxxxx> wrote:

> 

> Hi Neil, Alasdair,

> 

> thanks for the feedback. Answering your points in reverse order -

> 

> > > Might it make sense to configure a range of the device where writes

> > > always went down all paths?   That would seem to fit with your

> > > problem description and might be easiest??

> > Indeed - a persistent property of the device (even another interface

> > with a different minor number) not the I/O.

> 

> I'm not so sure that would be required though. The equivalent of our

> "mkfs" tool wouldn't need this. Also, typically, this would be a

> partition (kpartx) on top of a regular MPIO mapping (that we want to be

> managed by multipathd).

> 

> Handling this completely differently would complicate setup, no?

> 

> > And what is the nature of the data being written, given that I/O to

> > one path might get delayed and arrive long after it was sent,

> > overwriting data sent later.   Successful stale writes will always be

> > recognised as such by readers - how?

> 

> The very particular use case I am thinking of is the "poison pill" for

> node-level fencing. Nodes constantly monitor their slot (using direct

> IO, bypassing all caching, etc), and either can successfully read it or

> commit suicide (assisted by a hardware watchdog to protect against

> stalls).

> 

> The writer knows that, once the message has been successfully written,

> the target node will either have read it (and committed suicide), or

> been self-fenced because of a timeout/read error.

> 

> Allowing for the additional timeouts incurred by MPIO here really slows

> this mechanism down to the point of being unusable.

> 

> Now, even if a write was delayed - which is not very likely, it's more

> likely that some of the IO will just fail if indeed one of the paths

> happens to go down, and this would not resubmit it to other paths -, the

> worst that could happen would be a double fence. (If it gets written

> after the node has cycled once and cleared its message slot; that would

> imply a significant delay already, since servers take a bit to boot.)

> 

> For the 'heartbeat' mechanism and others (if/when we get around for

> adding them), we could ignore the exact contents that have been written

> and just watch for changes; worst, the node death detection will take a

> bit longer.

> 

> Basically, the thing we need to get around is the possible IO latency in

> MPIO, for things like poison pill fencing ("storage-based death") or

> qdisk-style plugins. I'm open for other suggestions as well.

> 

> 

> 

> Regards,

>         Lars

> 

> -- 

> Architect Storage/HA, OPS Engineering, Novell, Inc.

> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG NÃrnberg)

> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde

> 

> --

> dm-devel mailing list

> dm-devel@xxxxxxxxxx

> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel