Re: dm-multipath - IO queue dispatch based on FPIN Congestion/Latency notifications.

Muneendra Kumar M <muneendra.kumar@xxxxxxxxxxxx> · Wed, 31 Mar 2021 16:18:34 +0530

Hi Martin,
Below are my replies.

>If there was any discussion, I haven't been involved :-)

>I haven't looked into FPIN much so far. I'm rather sceptic with it's
usefulness for dm-multipath. Being a property of FC-2, FPIN works at least
2 layers below dm-multipath. dm-multipath is agnostic against protocol and
transport properties by design. User space multipathd can cross these
layers and tune dm-multipath based on lower-level properties, but such
actions  have rather large latencies.

>As you know, dm-multipath has 3 switches for routing IO via different
paths:

> 1 priority groups,
> 2 path status (good / failed)
 >3 path selector algorithm

>1) and 2) are controlled by user space, and have high latency.

>The current "marginal" concept in multipathd watches paths for repeated
failures, and configures the kernel to avoid using paths that are
considered marginal, using methods 1) and 2). This is a very-high- latency
algorithm that >changes state on the time scale of minutes.
>There is no concept for "delaying" or "pausing" IO on paths on short time
scale.

>The only low-latency mechanism is 3). But it's block level, no existing
selector looks at transport-level properties.

>That said, I can quite well imagine a feedback mechanism based on
throttling or delays applied in the FC drivers. For example, it a remote
port was throttled by the driver in response to FPIN messages, it's
bandwidth would >decrease, and a path selector like "service-time"
>would automatically assign less IO to such paths. This wouldn't need any
changes in dm-multipath or multipath-tools, it would work entirely on the
FC level.

[Muneendra]Agreed.

>Talking about improving the current "marginal" algorithm in multipathd,
and knowing that it's slow, FPIN might provide additional data that would
be good to have. Currently, multipathd only has 2 inputs, "good<->bad"
state >transitions based either on kernel I/O errors or path checker
results, and failure statistics from multipathd's internal "io_err_stat"
thread, which only reads sector 0. This could obviously be improved, but
there may actually be >lower-hanging fruit than evaluating FPIN
notifications (for example, I've pondered utilizing the kernel's blktrace
functionality to detect unusually long IO latencies or bandwidth drops).

>Talking about FPIN, is it planned to notify user space about such fabric
events, and if yes, how?

[Muneendra]Yes. FC drivers, when receiving FC FPIN ELS's are calling a
scsi transport routine with the FPIN payload.  The transport
is pushing this as an "event" via netlink.  An app bound to the local
address used by the scsi transport can receive the event and parse it.

Benjamin has added a marginal_path group(multipath marginal pathgroups) in
the dm-multipath.
https://patchwork.kernel.org/project/dm-devel/cover/1564763622-31752-1-git
-send-email-bmarzins@xxxxxxxxxx/

One of the intention of the Benjamin's patch (support for maginal path) is
to support for the FPIN events we receive from fabric.
On receiving the fpin-li our intention was to  place all the paths that
are affected into the marginal path group.

Below are the 4 types of descriptors returned in an FPIN:
•	Link Integrity (LN): some error on a link that affected frames,
which is the main one for "flaky path"
•	Delivery Notification (DN):  something explicitly knew about a
dropped frame and is reporting it. Usually, things like a CRC error says
you can't trust the frame header, so you it's a LI error. But if you do
have a valid frame, but drop it, such as a fabric edge timer (don't queue
it more the 250-600ms), then it becomes a DN type. Could be flaky path,
but not necessarily.
•	Congestion (CN): fabric is saying it's congested sending to "your"
port. Meaning if a host receives it - fabric is saying it has more frames
for the host than it's pulling in so it's backing up the fabric.What
should happen is load by the host should be lowered - but it's across all
targets. Not all targets are perhaps in the mpio path list
•	Peer Congestion (PCN): this goes along with CN in that the fabric
is now telling the other devices in the zone sending traffic to that
congested port that the other port is backing up. So the idea is these
peer send less load to the congested port.  Shouldn't really tie to mpio.
some of the current thinking is targets could see this and reduce their
transmission rate to a host to the link speed of the host

On receiving the congestion notifications our intention is to slowdown the
work load gradually from the host until it stops receiving the congestion
notifications.
We need to validate the same how we can achieve the same of decreasing the
workloads with the help of dm-multipath.

As Hannes mentioned  in his earlier mail our primary goal is that the
admin first should be _alerted_, having FPINs showing up in the message
log, to alert the
admin that his fabric is not performing well.

Regards,
Muneendra.

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel