Hello Muneendra
On Wed, 2021-03-31 at 16:18 +0530, Muneendra Kumar M wrote: Hi Martin,
Below are my replies.
If there was any discussion, I haven't been involved :-)
I haven't looked into FPIN much so far. I'm rather sceptic with it's
usefulness for dm-multipath. Being a property of FC-2, FPIN works at least
2 layers below dm-multipath. dm-multipath is agnostic against protocol and
transport properties by design. User space multipathd can cross these
layers and tune dm-multipath based on lower-level properties, but such
actions have rather large latencies.
As you know, dm-multipath has 3 switches for routing IO via different
paths:
1 priority groups,
2 path status (good / failed)
>3 path selector algorithm
1) and 2) are controlled by user space, and have high latency.
The current "marginal" concept in multipathd watches paths for repeated
failures, and configures the kernel to avoid using paths that are
considered marginal, using methods 1) and 2). This is a very-high- latency
algorithm that >changes state on the time scale of minutes.
There is no concept for "delaying" or "pausing" IO on paths on short time
scale.
The only low-latency mechanism is 3). But it's block level, no existing
selector looks at transport-level properties.
That said, I can quite well imagine a feedback mechanism based on
throttling or delays applied in the FC drivers. For example, it a remote
port was throttled by the driver in response to FPIN messages, it's
bandwidth would >decrease, and a path selector like "service-time"
would automatically assign less IO to such paths. This wouldn't need any
changes in dm-multipath or multipath-tools, it would work entirely on the
FC level.
[Muneendra]Agreed.
I think the only way the FC drivers can respond to this is by delaying the R_RDY primitives resulting in less credits being available for the remote side to use. That only works on a link layer and not fabric wide. It cannot change linkspeed at all as that would bounce a port resulting in all sorts of state changes. That being said this is already the existing behavior and not really tied to fpins. The goal of the fpin method was to provide a more proactive method and inform the OS layer of fabric issues so it could act upon it by adjusting the IO profile.
Talking about improving the current "marginal" algorithm in multipathd,
and knowing that it's slow, FPIN might provide additional data that would
be good to have. Currently, multipathd only has 2 inputs, "good<->bad"
state >transitions based either on kernel I/O errors or path checker
results, and failure statistics from multipathd's internal "io_err_stat"
thread, which only reads sector 0. This could obviously be improved, but
there may actually be >lower-hanging fruit than evaluating FPIN
notifications (for example, I've pondered utilizing the kernel's blktrace
functionality to detect unusually long IO latencies or bandwidth drops).
Talking about FPIN, is it planned to notify user space about such fabric
events, and if yes, how?
[Muneendra]Yes. FC drivers, when receiving FC FPIN ELS's are calling a
scsi transport routine with the FPIN payload. The transport
is pushing this as an "event" via netlink. An app bound to the local
address used by the scsi transport can receive the event and parse it.
Benjamin has added a marginal_path group(multipath marginal pathgroups) in
the dm-multipath.
One of the intention of the Benjamin's patch (support for maginal path) is
to support for the FPIN events we receive from fabric.
On receiving the fpin-li our intention was to place all the paths that
are affected into the marginal path group.
I think this should all be done in kernel space as we're talking sub-millisecond timings here when it comes to fpins and the reaction time expected. I may be wrong but I'll leave that up to you.
Below are the 4 types of descriptors returned in an FPIN:
• Link Integrity (LN): some error on a link that affected frames,
which is the main one for "flaky path"
• Delivery Notification (DN): something explicitly knew about a
dropped frame and is reporting it. Usually, things like a CRC error says
you can't trust the frame header, so you it's a LI error. But if you do
have a valid frame, but drop it, such as a fabric edge timer (don't queue
it more the 250-600ms), then it becomes a DN type. Could be flaky path,
but not necessarily.
• Congestion (CN): fabric is saying it's congested sending to "your"
port. Meaning if a host receives it - fabric is saying it has more frames
for the host than it's pulling in so it's backing up the fabric.What
should happen is load by the host should be lowered - but it's across all
targets. Not all targets are perhaps in the mpio path list
• Peer Congestion (PCN): this goes along with CN in that the fabric
is now telling the other devices in the zone sending traffic to that
congested port that the other port is backing up. So the idea is these
peer send less load to the congested port. Shouldn't really tie to mpio.
some of the current thinking is targets could see this and reduce their
transmission rate to a host to the link speed of the host
On receiving the congestion notifications our intention is to slowdown the
work load gradually from the host until it stops receiving the congestion
notifications.
We need to validate the same how we can achieve the same of decreasing the
workloads with the help of dm-multipath.
Would it be possible to piggyback on the service time path selector in this when it pertains latency?
Another thing is that at some stage the IO queueing decision needs to take into account the various different FPIN descriptors. A remote delivery notification due to slow drain behaviour is very different than ISL congestion or any physical issues.
As Hannes mentioned in his earlier mail our primary goal is that the
admin first should be _alerted_, having FPINs showing up in the message
log, to alert the
admin that his fabric is not performing well.
This is a bit of a reactive approach that should be a secondary objective. Having been in storage/fc support for 20 years I know that most admins are not really responsive to this and taking actions based on event entries take a very very long time. From an operations perspective any sort of manual action should be avoided as much as possible.
Regards,
Muneendra.
|