Re: [PATCH] ceph: make osd_request_timeout changable online in debugfs

Ilya Dryomov <idryomov@xxxxxxxxx> · Mon, 28 May 2018 11:13:23 +0200

On Sat, May 26, 2018 at 3:21 AM, Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
> [resend because of a SMTP error, please ignore this if you have received
> it.......]
>
> Hi Ilya,
>     I think there is no conflict between this patch and -o full-force. We
> can use them
> in different use cases.
>
> (1), This patch is simple.
>        When we are going to fix the problem of umounting fs and unmap device
> but the ceph cluster
> is unavailable in production, we want the logic to be as simple as possible,
> which will
> introduce regression with a very little possibility. Especially when we need
> to backport
> commits to stable branches.

This patch is simple only because it doesn't handle all cases.  As you
acknowledged, it doesn't deal with requests that are stuck on exclusive
lock at all.  Once you make it do that, it won't be any simpler than
-o full-force patch.

The fundamental problem with this patch is that it introduces a timeout
at the wrong level.  If we were to add such a timeout, it would need to
work at the rbd level and not within libceph, because there are things
at the rbd level that need handling when the timeout is fired.

>
> (2), When we don't want to change the original logical of user applications.
>        Let's compare the work we need to do in higher-level applications, if
> we are going to
> use full-force to solve the problem, we need to change the user
> applications, for example,
> in k8s, it's going to umount fs at first and then detachdisk. That's not
> easy to change the framework
> of it.
>
> (3), When we don't want to implement another "timeout and retry with
> full-force"
>        As what we discussed about the full-force, IIUC, we don't have to use
> full-force
> at first, but we should try it with normal way, and retry with full-force
> when a timedout.
> For example, you mentioned, we can retry when we got a specified Signal in
> systemd shuting
> down. But in some other use case, we have to implement this timeout and
> retry mechanism.

-o full-force is a mechanism, "try it with normal way, and retry with
full-force when a timedout" is a policy.  One may want to do something
else before forcing, leave it up to the user, etc. Or, if the cluster
is known to be permanently unavailable, force without a timeout.

>
> And yes, there is some other cases this patch is not suitable, for example,
> when the system don't have debugfs mounted.
>
> So I think we can merge this patch into upstream, but continue to implement
> full-force.

I'm not opposed to wiring up an rbd level timeout, but in order to be
merged the code must handle all cases.  The reason I suggested to wait
for -o full-force is that it should take care of the hard stuff and
make implementing a proper timeout handler much easier.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html