Re: dm-multipath - IO queue dispatch based on FPIN Congestion/Latency notifications.

Erwin van Londen <erwin@xxxxxxxxxxxxxxxxxx> · Wed, 31 Mar 2021 18:12:57 +1000

Hello Hannes,

Thanks for responding.

On Wed, 2021-03-31 at 09:25 +0200, Hannes Reinecke wrote:
Hi Erwin,

On 3/31/21 2:22 AM, Erwin van Londen wrote:
Hello Muneendra, benjamin,

The fpin options that are developed do have a whole plethora of options
and do not mainly trigger paths being in a marginal state. Th mpio layer
could utilise the various triggers like congestion and latency and not
just use a marginal state as a decisive point. If a path is somewhat
congested the amount of io's dispersed over these paths could just be
reduced by a flexible margin depending on how often and which fpins are
actually received. If for instance and fpin is recieved that an upstream
port is throwing physical errors you may exclude is entirely from
queueing IO's to it. If it is a latency related problem where credit
shortages come in play you may just need to queue very small IO's to it.
The scsi CDB will tell the size of the IO. Congestion notifications may
just be used for potentially adding an artificial  delay to reduce the
workload on these paths and schedule them on another.

As correctly noted, FPINs come with a variety of options.
And I'm not certain we can everything correctly; a degraded path is
simple, but for congestion there is only _so_ much we can do.
The typical cause for congestion is, say, a 32G host port talking to a
16G (or even 8G) target port _and_ a 32G target port.
Congestion can also be caused by a change in workload characteristics where, for example, read and write workload start interfering. The funnel principle would not apply in that case.

So the host cannot 'tune down' it's link to 8G; doing so would impact
performance on the 32G target port.
(And we would suffer reverse congestion whenever that target port sends
frames).

And throttling things on the SCSI layer only helps _so_ much, as the
real congestion is due to the speed with which the frames are sequenced
onto the wire. Which is not something we from the OS can control.
If you can interleave IOs with an artificial delay depending on the type and frequency these FPINS arrive you would be able to prevent latency buildup in the san.

From another POV this is arguably a fabric mis-design; so it _could_ be
alleviated by separating out the ports with lower speeds into its own
zone (or even on a separate SAN); that would trivially make the
congestion go away.
The entire FPIN concept was designed to be able to provide clients with the option to respond and react to changing behaviours in sans. A mis-design is often not really the case but ongoing changes and continuous provisioning is  mainly contributing to the case. 

But for that the admin first should be _alerted_, and this really is my
primary goal: having FPINs showing up in the message log, to alert the
admin that his fabric is not performing well.
I think the FC drivers are already having facilities to do that or they will have that shortly. dm-multipath is not really required to handle the notifications but would be useful if actions have been done based on fpins. 

A second step will be to massaging FPINs into DM multipath, and have it
influencing the path priority or path status. But this is currently
under discussion how it could be integrated best.
OK

Not really sure what the possibilities are from a DM-Multipath
viewpoint, but I feel if the OS options are not properly aligned with
what the FC protocol and HBA drivers are able to provide we may miss a
good opportunity to optimize the dispersion of IO's and improve overall
performance. 

Looking at the size of the commands is one possibility, but at this time
this presumes too much on how we _think_ FPINs will be generated.
I'd rather do some more tests to figure out under which circumstances we
can expect which type of FPINs, and then start looking for ways on how
to integrate them.
The FC protocol only describes the framework and not the values that need to be adhered to. That depends on the end devices and their capabilities. 

Cheers,

Hannes

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel