Re: mds0: Behind on trimming (58621/30)

xiaoxi chen <superdebugger@xxxxxxxxxxx> · Tue, 5 Jul 2016 14:54:27 +0000

> From: ukernel@xxxxxxxxx
> Date: Tue, 5 Jul 2016 21:14:12 +0800
> To: kenneth.waegeman@xxxxxxxx
> CC: ceph-users@xxxxxxxxxxxxxx
> Subject: Re: [ceph-users] mds0: Behind on trimming (58621/30)
> 
> On Tue, Jul 5, 2016 at 7:56 PM, Kenneth Waegeman
> <kenneth.waegeman@xxxxxxxx> wrote:
> >
> >
> > On 04/07/16 11:22, Kenneth Waegeman wrote:
> >>
> >>
> >>
> >> On 01/07/16 16:01, Yan, Zheng wrote:
> >>>
> >>> On Fri, Jul 1, 2016 at 6:59 PM, John Spray <jspray@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Fri, Jul 1, 2016 at 11:35 AM, Kenneth Waegeman
> >>>> <kenneth.waegeman@xxxxxxxx> wrote:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> While syncing a lot of files to cephfs, our mds cluster got haywire:
> >>>>> the
> >>>>> mdss have a lot of segments behind on trimming:  (58621/30)
> >>>>> Because of this the mds cluster gets degraded. RAM usage is about 50GB.
> >>>>> The
> >>>>> mdses were respawning and replaying continiously, and I had to stop all
> >>>>> syncs , unmount all clients and increase the beacon_grace to keep the
> >>>>> cluster up .
> >>>>>
> >>>>> [root@mds03 ~]# ceph status
> >>>>>      cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
> >>>>>       health HEALTH_WARN
> >>>>>              mds0: Behind on trimming (58621/30)
> >>>>>       monmap e1: 3 mons at
> >>>>>
> >>>>> {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}
> >>>>>              election epoch 170, quorum 0,1,2 mds01,mds02,mds03
> >>>>>        fsmap e78658: 1/1/1 up {0=mds03=up:active}, 2 up:standby
> >>>>>       osdmap e19966: 156 osds: 156 up, 156 in
> >>>>>              flags sortbitwise
> >>>>>        pgmap v10213164: 4160 pgs, 4 pools, 253 TB data, 203 Mobjects
> >>>>>              357 TB used, 516 TB / 874 TB avail
> >>>>>                  4151 active+clean
> >>>>>                     5 active+clean+scrubbing
> >>>>>                     4 active+clean+scrubbing+deep
> >>>>>    client io 0 B/s rd, 0 B/s wr, 63 op/s rd, 844 op/s wr
> >>>>>    cache io 68 op/s promote
> >>>>>
> >>>>>
> >>>>> Now it finally is up again, it is trimming very slowly (+-120 segments
> >>>>> /
> >>>>> min)
> >>>>
> >>>> Hmm, so it sounds like something was wrong that got cleared by either
> >>>> the MDS restart or the client unmount, and now it's trimming at a
> >>>> healthier rate.
> >>>>
> >>>> What client (kernel or fuse, and version)?
> >>>>
> >>>> Can you confirm that the RADOS cluster itself was handling operations
> >>>> reasonably quickly?  Is your metadata pool using the same drives as
> >>>> your data?  Were the OSDs saturated with IO?
> >>>>
> >>>> While the cluster was accumulating untrimmed segments, did you also
> >>>> have a "client xyz failing to advanced oldest_tid" warning?
> >>>
> >>> This does not prevent MDS from trimming log segment.
> >>>
> >>>> It would be good to clarify whether the MDS was trimming slowly, or
> >>>> not at all.  If you can reproduce this situation, get it to a "behind
> >>>> on trimming" state, and the stop the client IO (but leave it mounted).
> >>>> See if the (x/30) number stays the same.  Then, does it start to
> >>>> decrease when you unmount the client?  That would indicate a
> >>>> misbehaving client.
> >>>
> >>> Behind on trimming on single MDS cluster should be caused by either
> >>> slow rados operations or MDS trim too few log segments on each tick.
> >>>
> >>> Kenneth, could you try setting mds_log_max_expiring to a large value
> >>> (such as 200)
> >>
> >> I've set the mds_log_max_expiring to 200 right now. Should I see something
> >> instantly?
> >
> > The trimming finished rather quick, although I don't have any accurate time
> > measures. Cluster looks running fine right now, but running incremental
> > sync. We will try with same data again to see if it is ok now.
> > Is this mds_log_max_expiring option production ready ? (Don't seem to find
> > it in documentation)
> 
> It should be safe. Setting mds_log_max_expiring to 200 does not change
> the code path
> 
> Yan, Zheng
> 
> >

Zheng,
     Bump up this conf from 20 -> 200 seems increase the load(concurrent) of flushing? would you prefer make this default?

Xiaoxi

> > Thank you!!
> >
> > K
> >
> >>
> >> This weekend , the trimming did not contunue and something happened to the
> >> cluster:
> >>
> >> mds.0.cache.dir(1000da74e85) commit error -2 v 2466977
> >> log_channel(cluster) log [ERR] : failed to commit dir 1000da74e85 object,
> >> errno -2
> >> mds.0.78429 unhandled write error (2) No such file or directory, force
> >> readonly...
> >> mds.0.cache force file system read-only
> >> log_channel(cluster) log [WRN] : force file system read-only
> >>
> >> and ceph health reported:
> >> mds0: MDS in read-only mode
> >>
> >> I restarted it and it is trimming again.
> >>
> >>
> >> Thanks again!
> >> Kenneth
> >>>
> >>> Regards
> >>> Yan, Zheng
> >>>
> >>>> John
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com