Re: [PATCH] ceph: make osd_request_timeout changable online in debugfs

Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> · Thu, 24 May 2018 22:58:03 +0800

On 24/05/2018 22:55, Ilya Dryomov wrote:
On Thu, May 24, 2018 at 4:19 PM, 杨东升 <dongsheng.yang@xxxxxxxxxxxx> wrote:
Hi Ilya,

At 2018-05-24 21:17:55, "Ilya Dryomov" <idryomov@xxxxxxxxx> wrote:
On Thu, May 24, 2018 at 11:15 AM, Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
On 05/24/2018 04:23 PM, Ilya Dryomov wrote:
On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
Default value of osd_request_timeout is 0 means never timeout,
and we can set this value in rbd mapping with -o
"osd_request_timeout=XX".
But we can't change this value online.
Hi Dongsheng,

Changing just osd_request_timeout won't do anything about outstanding
requests waiting for the acquisition of the exclusive lock.  This is an
rbd problem and should be dealt with in rbd.c.
Yes, we are using images without exclusive-lock.

And yes, the -ETIMEDOUT will try to aquire lock again and again
currently,
we need to handle it properly.
Well, any change that is pushed upstream needs to work with the full set
of features.
Yes, agreed. will fix it later after our discussion below.

[Question 1]: Why we need to set osd_request_timeout?
When we are going to reboot a node which have krbd devices mapped,
even with the rbdmap.service enabled, we will be blocked
in shuting down if the ceph cluster is not working.

Especially, If we have three controller nodes which is running as ceph
mon,
but at the same time, there are some k8s pod with krbd devices on this
nodes,
then we can't shut down the last controller node when we want to
shutdown
all nodes, because when we are going to shutdown last controller node,
the
ceph cluster is not reachable.
Why can't rbd images be unmapped in a proper way before the cluster is
shutdown?
That's optional, that's another solution in our plan. But this solution
can't solve the problem if the ceph cluster is not working and we
can't recover it soon, then we can't reboot nodes and the all IO
threads is in D state.
That is (hopefully!) an extreme case.  Rebooting nodes with I/O threads
in D state is a sure way to lose data.

Wouldn't implementing proper unmapping somewhere in the orchestration
layer make this problem pretty much go away?
It's possible and we are doing that work. but this patch actually is going
to
fix the last problem you mentioned.........

[Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"?
We don't want to set the osd_request_timeout in rbd device whole
lifecycle,
there would be some problem in networking or cluster recovery to make
the request timeout. This would make the fs readonly and application
down.

[Question 3]: How can this patch solve this problems?
With this patch, we can map rbd device with default value of
osd_reques_timeout,
means never timeout, then we can solve the problem mentioned Question
2.

At the same time we can set the osd_request_timeout to what we need,
in system shuting down, for example, we can do this in rbdmap.service.
then we can make sure we can shutdown or reboot host normally no matter
ceph cluster is working well or not. This can solve the problem
mentioned
in Question 1.
The plan is to add a new unmap option, so one can do "rbd unmap -o
full-force", as noted in https://tracker.ceph.com/issues/20927.  This
is an old problem but it's been blocked on various issues in libceph.
Most of them look like they will be resolved in 4.18, bringing better
support for "umount -f" in kcephfs.  We should be able to reuse that
work for "rbd unmap -o full-force".
But even we have a full-force unmap for rbd device, we need to umount
the fs first, e.g ext4. When we do umount fs before rbd unmap, we will
be blocked, and the umount process would be in D state.
I think there are two separate issues here.  The first is that there is
no way to say "I don't care about possible data loss, unmap this device
no matter what".  This is what -o full-force is targeted at.  As I said
in the github PR, you wouldn't need to umount before unmapping with it.

The second issue is what to do on reboot if the cluster is unavailable,
i.e. _whether_ and when to raise the axe and fail outstanding requests.
The problem is that there is no way to distinguish between a transient
hold up and a full outage.  I'm not sure whether a default timeout will
suite everybody, because people get really unhappy when they lose data.
..... Yes, this is what we want to solve.  We design to set the timeout from
the setting of rbdmap.service and then to do a umount and unmap.

I agree that it's difficult to decide the default timeout of it. So we give
an option to user to decide it. And set the default of it to 0.

https://github.com/yangdongsheng/ceph/commit/8dceacac1e9c707f011157b433f0cbd1c7053f1e#diff-59215bb23b07b89b90f1cd0e02abb8d9R10
Your github PR was setting a default timeout of 60 seconds for
everybody -- that is what I reacted to.

HA， yes, that was not a good idea.

Then I think that will not make it worse at least, but we have a optional
way to solve this kind of problem if you want. What do you think ?
I won't object to a new rbdmap.service configurable, but I think it
should be implemented in terms of -o full-force, with the timeout logic
based on systemd's mechanisms.  IIRC if the service doesn't stop in
time, systemd will send a SIGTERM followed by a SIGKILL after another
timeout.  I think it could be as simple as "unmap everything with -o
full-force" on SIGKILL or something like that.

Let's wait for -o full-force and then reevaluate.

okey, let's wait for full-force.

thanx

Thanks,

                 Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html