Re: how to fix mds stuck at dispatched without restart ads

Xiubo Li <xiubli@xxxxxxxxxx> · Fri, 2 Sep 2022 13:26:08 +0800

On 9/1/22 4:29 PM, zxcs wrote:
Thanks a lot, xiubo!!!

this time we still restarted mds fix this due to user urgent need list 
/path/to/A/, i will try to mds debug log if we hit it again.

Also, haven’t try flush mds journal before, any side effect to do 
this? This cephfs cluster is a production environment, we need very 
careful to do anything.

No side effect. The MDS will also do the journal flush sooner or later. 
But I am afraid this won't help for this issue, but still could have a try.

And I read this bug details https://tracker.ceph.com/issues/50840 
<https://tracker.ceph.com/issues/50840> , from pre-mail mentions it 
fixed in 15.2.17, we using ceps-deploy to deploy(upgrade) cephfs, 
seems it latest ceph version is 15.2.16? Will update if we can fixed 
after upgrade.

Will try to flush ads journal option when we hit this bug next time(if 
no user urgent need list directory). Seems it can 100% recurrent these 
days. Thanks All!

If possible please enable the 'debug_mds = 10' and 'debug_ms = 1'.

Thanks!

Thanks,
zx

2022年8月31日 15:23，Xiubo Li <xiubli@xxxxxxxxxx 
<mailto:xiubli@xxxxxxxxxx>> 写道：

On 8/31/22 2:43 PM, zxcs wrote:
Hi, experts

we have a cephfs(15.2.13) cluster with kernel mount, and when we 
read from 2000+ processes to one ceph path(called /path/to/A/), then 
all of the process hung, and ls -lrth /path/to/A/ always stuck, but 
list other directory are health( /path/to/B/),

health detail always report mds has slow request.  And then we need 
to restart the mds fix this issue.

How can we fix this without restart mds(restart mds always impact 
other users)?

Any suggestions are welcome! Thanks a ton!

from this dump_ops_in_flight:

"description": "client_request(client.100807215:2856632 getattr 
AsLsXsFs #0x200978a3326 2022-08-31T09:36:30.444927+0800 
caller_id=2049, caller_gid=2049})",
"initiated_at": "2022-08-31T09:36:30.454570+0800",
"age": 17697.012491966001,
"duration": 17697.012805568,
"type_data": {
"flag_point": "dispatched",
"reqid": "client. 100807215:2856632",
"op_type": "client_request",
"client_info":
"client": "client.100807215",
"tid": 2856632
"events":
"time": "2022-08-31T09:36:30.454570+0800",
"event": "initiated"

"time": "2022-08-31T09:36:30.454572+0800",
"event": "throttled"

"time": "2022-08-31T09:36:30.454570+0800",
"event": "header read"

"time": "2022-08-31T09:36:30.454580+0800",
'event": "all_read"
"time": "2022-08-31T09:36:30.454604+0800",
"event": "dispatched"
}

AFAIK there is no easy way to do this. At least we need to know why 
it gets stuck and where. From your above and the previous mail 
thread, it should stuck in getattr request and sounds like a smiliar 
issue withhttps://tracker.ceph.com/issues/50840 
<https://tracker.ceph.com/issues/50840>.

If it's not, it should be a new bug and could you create one tracker 
and provide the mds side debug logs.

Maybe you can try to flush the mds journal to see what will happen ?

- Xiubo

Thanks,
Xiong
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
To unsubscribe send an email toceph-users-leave@xxxxxxx 
<mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx