On Thu, May 24, 2018 at 4:19 PM, 杨东升 <dongsheng.yang@xxxxxxxxxxxx> wrote: > Hi Ilya, > > At 2018-05-24 21:17:55, "Ilya Dryomov" <idryomov@xxxxxxxxx> wrote: >>On Thu, May 24, 2018 at 11:15 AM, Dongsheng Yang >><dongsheng.yang@xxxxxxxxxxxx> wrote: >>> On 05/24/2018 04:23 PM, Ilya Dryomov wrote: >>>> >>>> On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang >>>> <dongsheng.yang@xxxxxxxxxxxx> wrote: >>>>> >>>>> Default value of osd_request_timeout is 0 means never timeout, >>>>> and we can set this value in rbd mapping with -o >>>>> "osd_request_timeout=XX". >>>>> But we can't change this value online. >>>> >>>> Hi Dongsheng, >>>> >>>> Changing just osd_request_timeout won't do anything about outstanding >>>> requests waiting for the acquisition of the exclusive lock. This is an >>>> rbd problem and should be dealt with in rbd.c. >>> >>> Yes, we are using images without exclusive-lock. >>> >>> And yes, the -ETIMEDOUT will try to aquire lock again and again >>> currently, >>> we need to handle it properly. >> >>Well, any change that is pushed upstream needs to work with the full set >>of features. > > Yes, agreed. will fix it later after our discussion below. >> >>>> >>>> >>>>> [Question 1]: Why we need to set osd_request_timeout? >>>>> When we are going to reboot a node which have krbd devices mapped, >>>>> even with the rbdmap.service enabled, we will be blocked >>>>> in shuting down if the ceph cluster is not working. >>>>> >>>>> Especially, If we have three controller nodes which is running as ceph >>>>> mon, >>>>> but at the same time, there are some k8s pod with krbd devices on this >>>>> nodes, >>>>> then we can't shut down the last controller node when we want to >>>>> shutdown >>>>> all nodes, because when we are going to shutdown last controller node, >>>>> the >>>>> ceph cluster is not reachable. >>>> >>>> Why can't rbd images be unmapped in a proper way before the cluster is >>>> shutdown? >>> >>> That's optional, that's another solution in our plan. But this solution >>> can't solve the problem if the ceph cluster is not working and we >>> can't recover it soon, then we can't reboot nodes and the all IO >>> threads is in D state. >> >>That is (hopefully!) an extreme case. Rebooting nodes with I/O threads >>in D state is a sure way to lose data. >> >>Wouldn't implementing proper unmapping somewhere in the orchestration >>layer make this problem pretty much go away? > > It's possible and we are doing that work. but this patch actually is going > to > fix the last problem you mentioned......... >> >>>> >>>> >>>>> [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? >>>>> We don't want to set the osd_request_timeout in rbd device whole >>>>> lifecycle, >>>>> there would be some problem in networking or cluster recovery to make >>>>> the request timeout. This would make the fs readonly and application >>>>> down. >>>>> >>>>> [Question 3]: How can this patch solve this problems? >>>>> With this patch, we can map rbd device with default value of >>>>> osd_reques_timeout, >>>>> means never timeout, then we can solve the problem mentioned Question >>>>> 2. >>>>> >>>>> At the same time we can set the osd_request_timeout to what we need, >>>>> in system shuting down, for example, we can do this in rbdmap.service. >>>>> then we can make sure we can shutdown or reboot host normally no matter >>>>> ceph cluster is working well or not. This can solve the problem >>>>> mentioned >>>>> in Question 1. >>>> >>>> The plan is to add a new unmap option, so one can do "rbd unmap -o >>>> full-force", as noted in https://tracker.ceph.com/issues/20927. This >>>> is an old problem but it's been blocked on various issues in libceph. >>>> Most of them look like they will be resolved in 4.18, bringing better >>>> support for "umount -f" in kcephfs. We should be able to reuse that >>>> work for "rbd unmap -o full-force". >>> >>> But even we have a full-force unmap for rbd device, we need to umount >>> the fs first, e.g ext4. When we do umount fs before rbd unmap, we will >>> be blocked, and the umount process would be in D state. >> >>I think there are two separate issues here. The first is that there is >>no way to say "I don't care about possible data loss, unmap this device >>no matter what". This is what -o full-force is targeted at. As I said >>in the github PR, you wouldn't need to umount before unmapping with it. >> >>The second issue is what to do on reboot if the cluster is unavailable, >>i.e. _whether_ and when to raise the axe and fail outstanding requests. >>The problem is that there is no way to distinguish between a transient >>hold up and a full outage. I'm not sure whether a default timeout will >>suite everybody, because people get really unhappy when they lose data. > > ..... Yes, this is what we want to solve. We design to set the timeout from > the setting of rbdmap.service and then to do a umount and unmap. > > I agree that it's difficult to decide the default timeout of it. So we give > an option to user to decide it. And set the default of it to 0. > > https://github.com/yangdongsheng/ceph/commit/8dceacac1e9c707f011157b433f0cbd1c7053f1e#diff-59215bb23b07b89b90f1cd0e02abb8d9R10 Your github PR was setting a default timeout of 60 seconds for everybody -- that is what I reacted to. > > Then I think that will not make it worse at least, but we have a optional > way to solve this kind of problem if you want. What do you think ? I won't object to a new rbdmap.service configurable, but I think it should be implemented in terms of -o full-force, with the timeout logic based on systemd's mechanisms. IIRC if the service doesn't stop in time, systemd will send a SIGTERM followed by a SIGKILL after another timeout. I think it could be as simple as "unmap everything with -o full-force" on SIGKILL or something like that. Let's wait for -o full-force and then reevaluate. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html