Re: mds slow request with “failed to authpin, subtree is being exported"

Venky Shankar <vshankar@xxxxxxxxxx> · Tue, 5 Dec 2023 09:41:27 +0530

On Tue, Dec 5, 2023 at 6:34 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
>
>
> On 12/4/23 16:25, zxcs wrote:
> > Thanks a lot, Xiubo!
> >
> > we already set ‘mds_bal_interval’ to 0. and the slow mds seems decrease.
> >
> > But somehow we still see mds complain slow request. and from mds log , can see
> >
> > “slow request *** seconds old, received at 2023-12-04T…: internal op exportdir:mds.* currently acquired locks”
> >
> > so our question is, why it still see "internal op exportdir”, any other config also need to set 0? and could please shed light here which config we need set .
> >
> IMO, this should be enough.
>
> Venky,
>
> Did I miss something here ?

You missed nothing. Setting `mds_bal_interval = 0` disables the
balancer. I guess there are in-progress exports that would take some
time to backoff and the slow ops should eventually get cleaned up.

I'd say wait a bit and see if the slow request resolves by itself.
FWIW, there was a feature request a while back to cancel an ongoing
export. We should prioritize having that.

>
> Thanks
>
> - Xiubo
>
>
> > Thanks,
> > xz
> >
> >> 2023年11月27日 13:19，Xiubo Li <xiubli@xxxxxxxxxx> 写道：
> >>
> >>
> >> On 11/27/23 13:12, zxcs wrote:
> >>> current, we using `ceph config set mds mds_bal_interval 3600` to set a fixed time(1 hour).
> >>>
> >>> we also have a question about how to set no balance for multi active mds.
> >>>
> >>> means, we will enable multi active mds(to improve throughput) and no balance for these mds.
> >>>
> >>> and if we set mds_bal_interval as big number seems can void this issue?
> >>>
> >> You can just set 'mds_bal_interval' to 0.
> >>
> >>
> >>>
> >>> Thanks,
> >>> xz
> >>>
> >>>> 2023年11月27日 10:56，Ben <ruidong.gao@xxxxxxxxx> 写道：
> >>>>
> >>>> with the same mds configuration, we see exactly the same(problem, log and
> >>>> solution) with 17.2.5, constantly happening again and again in couples days
> >>>> intervals. MDS servers are stuck somewhere, ceph status reports no issue
> >>>> however. We need to restart some of the mds (if not all of them) to restore
> >>>> them back. Hopefully this could be fixed soon or get docs updated with
> >>>> warning for the balancer's usage in production environment.
> >>>>
> >>>> thanks and regards
> >>>>
> >>>> Xiubo Li <xiubli@xxxxxxxxxx> 于2023年11月23日周四 15:47写道：
> >>>>
> >>>>> On 11/23/23 11:25, zxcs wrote:
> >>>>>> Thanks a ton, Xiubo!
> >>>>>>
> >>>>>> it not disappear.
> >>>>>>
> >>>>>> even we umount the ceph directory on these two old os node.
> >>>>>>
> >>>>>> after dump ops flight , we can see some request, and the earliest
> >>>>> complain “failed to authpin, subtree is being exported"
> >>>>>> And how to avoid this, would you please help to shed some light here?
> >>>>> Okay, as Frank mentioned you can try to disable the balancer by pining
> >>>>> the directories. As I remembered the balancer is buggy.
> >>>>>
> >>>>> And also you can raise one ceph tracker and provide the debug logs if
> >>>>> you have.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> - Xiubo
> >>>>>
> >>>>>
> >>>>>> Thanks,
> >>>>>> xz
> >>>>>>
> >>>>>>
> >>>>>>> 2023年11月22日 19:44，Xiubo Li <xiubli@xxxxxxxxxx> 写道：
> >>>>>>>
> >>>>>>>
> >>>>>>> On 11/22/23 16:02, zxcs wrote:
> >>>>>>>> HI, Experts,
> >>>>>>>>
> >>>>>>>> we are using cephfs with  16.2.* with multi active mds, and recently,
> >>>>> we have two nodes mount with ceph-fuse due to the old os system.
> >>>>>>>> and  one nodes run a python script with `glob.glob(path)`, and another
> >>>>> client doing `cp` operation on the same path.
> >>>>>>>> then we see some log about `mds slow request`, and logs complain
> >>>>> “failed to authpin, subtree is being exported"
> >>>>>>>> then need to restart mds,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> our question is, does there any dead lock?  how can we avoid this and
> >>>>> how to fix it without restart mds(it will influence other users) ?
> >>>>>>> BTW, won't the slow requests disappear themself later ?
> >>>>>>>
> >>>>>>> It looks like the exporting is slow or there too many exports are going
> >>>>> on.
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> - Xiubo
> >>>>>>>
> >>>>>>>> Thanks a ton!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> xz
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Cheers,
Venky
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx