Re: Re: [PATCH] ceph: make osd_request_timeout changable online in debugfs

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 24 May 2018 16:55:02 +0200

On Thu, May 24, 2018 at 4:19 PM, 杨东升 <dongsheng.yang@xxxxxxxxxxxx> wrote:
> Hi Ilya,
>
> At 2018-05-24 21:17:55, "Ilya Dryomov" <idryomov@xxxxxxxxx> wrote:
>>On Thu, May 24, 2018 at 11:15 AM, Dongsheng Yang
>><dongsheng.yang@xxxxxxxxxxxx> wrote:
>>> On 05/24/2018 04:23 PM, Ilya Dryomov wrote:
>>>>
>>>> On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang
>>>> <dongsheng.yang@xxxxxxxxxxxx> wrote:
>>>>>
>>>>> Default value of osd_request_timeout is 0 means never timeout,
>>>>> and we can set this value in rbd mapping with -o
>>>>> "osd_request_timeout=XX".
>>>>> But we can't change this value online.
>>>>
>>>> Hi Dongsheng,
>>>>
>>>> Changing just osd_request_timeout won't do anything about outstanding
>>>> requests waiting for the acquisition of the exclusive lock.  This is an
>>>> rbd problem and should be dealt with in rbd.c.
>>>
>>> Yes, we are using images without exclusive-lock.
>>>
>>> And yes, the -ETIMEDOUT will try to aquire lock again and again
>>> currently,
>>> we need to handle it properly.
>>
>>Well, any change that is pushed upstream needs to work with the full set
>>of features.
>
> Yes, agreed. will fix it later after our discussion below.
>>
>>>>
>>>>
>>>>> [Question 1]: Why we need to set osd_request_timeout?
>>>>> When we are going to reboot a node which have krbd devices mapped,
>>>>> even with the rbdmap.service enabled, we will be blocked
>>>>> in shuting down if the ceph cluster is not working.
>>>>>
>>>>> Especially, If we have three controller nodes which is running as ceph
>>>>> mon,
>>>>> but at the same time, there are some k8s pod with krbd devices on this
>>>>> nodes,
>>>>> then we can't shut down the last controller node when we want to
>>>>> shutdown
>>>>> all nodes, because when we are going to shutdown last controller node,
>>>>> the
>>>>> ceph cluster is not reachable.
>>>>
>>>> Why can't rbd images be unmapped in a proper way before the cluster is
>>>> shutdown?
>>>
>>> That's optional, that's another solution in our plan. But this solution
>>> can't solve the problem if the ceph cluster is not working and we
>>> can't recover it soon, then we can't reboot nodes and the all IO
>>> threads is in D state.
>>
>>That is (hopefully!) an extreme case.  Rebooting nodes with I/O threads
>>in D state is a sure way to lose data.
>>
>>Wouldn't implementing proper unmapping somewhere in the orchestration
>>layer make this problem pretty much go away?
>
> It's possible and we are doing that work. but this patch actually is going
> to
> fix the last problem you mentioned.........
>>
>>>>
>>>>
>>>>> [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"?
>>>>> We don't want to set the osd_request_timeout in rbd device whole
>>>>> lifecycle,
>>>>> there would be some problem in networking or cluster recovery to make
>>>>> the request timeout. This would make the fs readonly and application
>>>>> down.
>>>>>
>>>>> [Question 3]: How can this patch solve this problems?
>>>>> With this patch, we can map rbd device with default value of
>>>>> osd_reques_timeout,
>>>>> means never timeout, then we can solve the problem mentioned Question
>>>>> 2.
>>>>>
>>>>> At the same time we can set the osd_request_timeout to what we need,
>>>>> in system shuting down, for example, we can do this in rbdmap.service.
>>>>> then we can make sure we can shutdown or reboot host normally no matter
>>>>> ceph cluster is working well or not. This can solve the problem
>>>>> mentioned
>>>>> in Question 1.
>>>>
>>>> The plan is to add a new unmap option, so one can do "rbd unmap -o
>>>> full-force", as noted in https://tracker.ceph.com/issues/20927.  This
>>>> is an old problem but it's been blocked on various issues in libceph.
>>>> Most of them look like they will be resolved in 4.18, bringing better
>>>> support for "umount -f" in kcephfs.  We should be able to reuse that
>>>> work for "rbd unmap -o full-force".
>>>
>>> But even we have a full-force unmap for rbd device, we need to umount
>>> the fs first, e.g ext4. When we do umount fs before rbd unmap, we will
>>> be blocked, and the umount process would be in D state.
>>
>>I think there are two separate issues here.  The first is that there is
>>no way to say "I don't care about possible data loss, unmap this device
>>no matter what".  This is what -o full-force is targeted at.  As I said
>>in the github PR, you wouldn't need to umount before unmapping with it.
>>
>>The second issue is what to do on reboot if the cluster is unavailable,
>>i.e. _whether_ and when to raise the axe and fail outstanding requests.
>>The problem is that there is no way to distinguish between a transient
>>hold up and a full outage.  I'm not sure whether a default timeout will
>>suite everybody, because people get really unhappy when they lose data.
>
> ..... Yes, this is what we want to solve.  We design to set the timeout from
> the setting of rbdmap.service and then to do a umount and unmap.
>
> I agree that it's difficult to decide the default timeout of it. So we give
> an option to user to decide it. And set the default of it to 0.
>
> https://github.com/yangdongsheng/ceph/commit/8dceacac1e9c707f011157b433f0cbd1c7053f1e#diff-59215bb23b07b89b90f1cd0e02abb8d9R10

Your github PR was setting a default timeout of 60 seconds for
everybody -- that is what I reacted to.

>
> Then I think that will not make it worse at least, but we have a optional
> way to solve this kind of problem if you want. What do you think ?

I won't object to a new rbdmap.service configurable, but I think it
should be implemented in terms of -o full-force, with the timeout logic
based on systemd's mechanisms.  IIRC if the service doesn't stop in
time, systemd will send a SIGTERM followed by a SIGKILL after another
timeout.  I think it could be as simple as "unmap everything with -o
full-force" on SIGKILL or something like that.

Let's wait for -o full-force and then reevaluate.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html