Hello,
I'm working on device-mapper multipath (dm-multipath).
This patch set adds a new hook for device-mapper in deciding the health of the
Of the multipath which helps in getting the deterministic Application IO throughput.
This patch set is preliminary tested on active-active 2 paths storage.
But the patch set still needs work and is not ready for inclusion.
I'm posting it because I'd like to get comments about high-level
design before going further in details.
This patch set should be applied on top of 3.10.0 #18
====================================================================
Background
=-=-=-=-=-=
• “Sick but not Dead” MPIO Path
‒ Path goes into Failed state because of path IO error as seen by DM driver
‒ When the multipath daemon issues TUR command finds health of the failed path is good, makes the same path into Active state
‒ Path repeatedly toggles between Failed and Active Path States
• DM IO is retried on path where we are hitting multiple errors.
• Causing erratic (non-deterministic) Application IO throughput
The current existing DM layer doesn't consider the amount of errors to decide the health of the path.
Since the failed path is becoming active immediately when the tur command succeeds the end user will be in a
Assumption that all the multipaths are in good state.
When we run some of the field tests with this scenario we saw a non-deterministic io throughput
=====================================================================
Design Overview
=-=-=-=-=-=-=-=-=
• Deterministically bring the path to “Faulty” state
‒ Configure per-DM device data with
• IO error threshold and time window for the error threshold to be hit
‒ Declare a path Faulty when error threshold is hit within the configured time window
‒ Place the path in the failed state for a predefined time configured by the administrator
using the config file
‒ Even though multipath daemon validates the path using TUR command which succeeds
and tries to re-instantiate the path ignore the re-instantiate of the path for a predefined time if the err threshold is hit.
• Give time for Administrator to correct the “Sick But not Dead” path and bring Path to Active
• Auto Enablement of a Faulty Path to Active State after a fixed time duration (given as a config data for each DM)
‒ Admin can set the Deterministic MPIO behavior on per-DM device basis
- It implies the failed path will be reinstantiated either by admin or when the timeout expires.
• The above configs will be made persistent across server reboots
Expected benefit:
-Deterministic Application IO throughput.
-We can give a time for the administrator to analyze the path failure and recover the path.
- user space tools need minimum change .
The above feature will be enabled only if the corresponding variables are defined in multipath.conf
Since these changes are irrespective of the underlying algorithms which they are using in dm layer.
The changes are applied in dm.c and dm-mpath.c
alloc_dev(),reinstate_path(),parse_path(),fail_path() are the functions which are going to be changed.
Need more comments on this as we started the testing and the results look determenestic.
Regards,
Muneendra.
On Thu, Dec 15, 2016 at 3:00 PM, muneendra kumar <muneendra737@xxxxxxxxx> wrote:
Hello,
This is the place where iam currently working and the details are given below
I'm working on device-mapper multipath (dm-multipath).
This patch set adds a new hook for device-mapper in deciding the health of the
Of the multipath which helps in getting the deterministic Application IO throughput.
This patch set is preliminary tested on active-active 2 paths storage.
But the patch set still needs work and is not ready for inclusion.
I'm posting it because I'd like to get comments about high-level
design before going further in details.
This patch set should be applied on top of 3.10.0 #18
==============================
============================== ======== Background
=-=-=-=-=-=
• “Sick but not Dead” MPIO Path‒ Path goes into Failed state because of path IO error as seen by DM driver‒ When the multipath daemon issues TUR command finds health of the failed path is good, makes the same path into Active state‒ Path repeatedly toggles between Failed and Active Path States• DM IO is retried on path where we are hitting multiple errors.• Causing erratic (non-deterministic) Application IO throughput
The current existing DM layer doesn't consider the amount of errors to decide the health of the path.Since the failed path is becoming active immediately when the tur command succeeds the end user will be in a
Assumption that all the multipaths are in good state.
When we run some of the field tests with this scenario we saw a non-deterministic io throughput
==============================
============================== ========= Design Overview
=-=-=-=-=-=-=-=-=
• Deterministically bring the path to “Faulty” state‒ Configure per-DM device data with• IO error threshold and time window for the error threshold to be hit‒ Declare a path Faulty when error threshold is hit within the configured time window‒ Place the path in the failed state for a predefined time configured by the administratorusing the config file
‒ Even though multipath daemon validates the path using TUR command which succeedsand tries to re-instantiate the path ignore the re-instantiate of the path for a predefined time if the err threshold is hit.• Give time for Administrator to correct the “Sick But not Dead” path and bring Path to Active• Auto Enablement of a Faulty Path to Active State after a fixed time duration (given as a config data for each DM)‒ Admin can set the Deterministic MPIO behavior on per-DM device basis- It implies the failed path will be reinstantiated either by admin or when the timeout expires.• The above configs will be made persistent across server reboots
Expected benefit:
-Deterministic Application IO throughput.-We can give a time for the administrator to analyze the path failure and recover the path.
- user space tools need minimum change .
Since these changes are irrespective of the underlying algorithms which they are using in dm layer.
The changes are applied in dm.c and dm-mpath.c
alloc_dev(),reinstate_path(),p
arse_path(),fail_path() are the functions which are going to be changed.
Need more comments on this as we started the testing and the results look determenestic.
On Mon, Dec 5, 2016 at 9:35 PM, muneendra kumar <muneendra737@xxxxxxxxx> wrote:Thanks a lot for sharing the info.I will discuss the problem in detail in my earlier mail,Regards,Muneendra.On Mon, Dec 5, 2016 at 5:45 PM, Zdenek Kabelac <zkabelac@xxxxxxxxxx> wrote:Dne 5.12.2016 v 07:29 muneendra kumar napsal(a):
Hi,
This is a general question.
If i do any changes in both multipath tool and dm driver (kernel).
How do i push my changes into main stream.
Can someone explain me the process so that it will help me a lot.
Hi
You propose your changes here on the list - you get a review and
it the patches are found useful - maintainer of dm subsystem
will accept them.
Note - it's usually better to ask and discuss 'ahead' what is your problem
and how do you want to improve/fix it.
So you avoid losing time on implementing unacceptable patch.
Regards
Zdenek
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel