Re: [PATCH] ceph: make osd_request_timeout changable online in debugfs

Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> · Mon, 28 May 2018 17:25:27 +0800

Hi Ilya,

On 05/28/2018 05:13 PM, Ilya Dryomov wrote:
On Sat, May 26, 2018 at 3:21 AM, Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
[resend because of a SMTP error, please ignore this if you have received
it.......]

Hi Ilya,
     I think there is no conflict between this patch and -o full-force. We
can use them
in different use cases.

(1), This patch is simple.
        When we are going to fix the problem of umounting fs and unmap device
but the ceph cluster
is unavailable in production, we want the logic to be as simple as possible,
which will
introduce regression with a very little possibility. Especially when we need
to backport
commits to stable branches.
This patch is simple only because it doesn't handle all cases.  As you
acknowledged, it doesn't deal with requests that are stuck on exclusive
lock at all.  Once you make it do that, it won't be any simpler than
-o full-force patch.

The fundamental problem with this patch is that it introduces a timeout
at the wrong level.  If we were to add such a timeout, it would need to
work at the rbd level and not within libceph, because there are things
at the rbd level that need handling when the timeout is fired.

(2), When we don't want to change the original logical of user applications.
        Let's compare the work we need to do in higher-level applications, if
we are going to
use full-force to solve the problem, we need to change the user
applications, for example,
in k8s, it's going to umount fs at first and then detachdisk. That's not
easy to change the framework
of it.

(3), When we don't want to implement another "timeout and retry with
full-force"
        As what we discussed about the full-force, IIUC, we don't have to use
full-force
at first, but we should try it with normal way, and retry with full-force
when a timedout.
For example, you mentioned, we can retry when we got a specified Signal in
systemd shuting
down. But in some other use case, we have to implement this timeout and
retry mechanism.
-o full-force is a mechanism, "try it with normal way, and retry with
full-force when a timedout" is a policy.  One may want to do something
else before forcing, leave it up to the user, etc. Or, if the cluster
is known to be permanently unavailable, force without a timeout.

And yes, there is some other cases this patch is not suitable, for example,
when the system don't have debugfs mounted.

So I think we can merge this patch into upstream, but continue to implement
full-force.
I'm not opposed to wiring up an rbd level timeout, but in order to be
merged the code must handle all cases.  The reason I suggested to wait
for -o full-force is that it should take care of the hard stuff and
make implementing a proper timeout handler much easier.

Okey, agree with this point. Let's wait for the full-force and
then decide whether necessary to implement a rbd level timeout.

Thanx :)

Thanks,

                 Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html