Re: cephfs health warn

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes, I am. 8 active + 2 standby, no subtree pinning. What if I restart the
mds with trimming issues? Trying to figure out what happens with restarting.

Venky Shankar <vshankar@xxxxxxxxxx> 于2023年10月3日周二 12:39写道:

> Hi Ben,
>
> Are you using multimds without subtree pinning?
>
> On Tue, Oct 3, 2023 at 10:00 AM Ben <ruidong.gao@xxxxxxxxx> wrote:
> >
> > Dear cephers:
> > more log captures(see below) show the full segments list(more than 30000
> to
> > be trimmed stuck, growing over time). any ideas to get out of this?
> >
> > Thanks,
> > Ben
> >
> >
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195341004/893374309813, 180 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195341184/893386318445, 145 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195341329/893386757388, 1024 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195342353/893388361174, 1024 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195343377/893389870480, 790 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195344167/893390955408, 1024 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195345191/893392321470, 1024 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195346215/893393752928, 1024 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195347239/893395131457, 2 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195347241/893395212055, 1024 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195348265/893396582755, 1024 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195349289/893398132191, 860 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expiring segment 195350149/893399338619, 42 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195350192/893408004655, 33 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195350226/893412331017, 23 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195350249/893416563419, 20 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195350269/893420702085, 244 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195350513/893424694857, 74 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195350587/893428947395, 843 events
> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
> > expired segment 195351430/893432893900, 1019 events
> > .
> > . (all expired items abbreviated)
> > .
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
> > expired segment 216605661/827226016068, 100 events
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
> > expired segment 216605761/827230263164, 153 events
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
> > expired segment 216605914/827234408294, 35 events
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
> > expired segment 216605949/827238527911, 1024 events
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
> > expired segment 216606973/827241813316, 344 events
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
> > expired segment 216607317/827242580233, 1024 events
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 6 mds.3.journal
> LogSegment(
> > 216608341/827244781542).try_to_expire
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 4 mds.3.sessionmap
> > save_if_dirty: writing 0
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 6 mds.3.journal
> LogSegment(
> > 216608341/827244781542).try_to_expire success
> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log _expired
> > segment 216608341/827244781542, 717 events
> >
> > Ben <ruidong.gao@xxxxxxxxx> 于2023年9月27日周三 23:53写道:
> >
> > > some further investigation about three mds with trimming behind
> problem:
> > > logs captured over two days show that, some log segments are stuck in
> > > trimming process. It looks like a bug with trimming log segment? Any
> > > thoughts?
> > > ==========log capture============
> > >
> > > 9/26:
> > >
> > > debug 2023-09-26T16:50:59.004+0000 7fc74d95e700 10 mds.3.log
> > > _trim_expired_segments waiting for 197465903/720757956586 to expire
> > >
> > >
> > > debug 2023-09-26T16:45:22.788+0000 7f9c253a8700 10 mds.4.log
> > > _trim_expired_segments waiting for 195341004/893374309813 to expire
> > >
> > >
> > > debug 2023-09-26T14:11:37.681+0000 7f8f047cf700 10 mds.5.log
> > > _trim_expired_segments waiting for 189942575/642272326541 to expire
> > >
> > >
> > > 9/27:
> > >
> > > debug 2023-09-27T15:35:04.841+0000 7fc74d95e700 10 mds.3.log
> > > _trim_expired_segments waiting for 197465903/720757956586 to expire
> > >
> > >
> > > debug 2023-09-27T15:18:51.442+0000 7f9c29bb1700 10 mds.4.log
> > > _trim_expired_segments waiting for 195341004/893374309813 to expire
> > >
> > >
> > > debug 2023-09-27T15:27:33.024+0000 7f8f047cf700 10 mds.5.log
> > > _trim_expired_segments waiting for 189942575/642272326541 to expire
> > >
> > >
> > > Regards,
> > >
> > > Ben
> > >
> > > Ben <ruidong.gao@xxxxxxxxx> 于2023年9月26日周二 20:31写道:
> > >
> > >> Hi,
> > >> see below for details of warnings.
> > >> the cluster is running 17.2.5. the warnings have been around for a
> while.
> > >> one concern of mine is num_segments growing over time. clients with
> warn
> > >> of MDS_CLIENT_OLDEST_TID increase from 18 to 25 as well. The nodes are
> > >> with kernel 4.19.0-91.82.42.uelc20.x86_64. It looks like bugs with
> > >> client library. And rebooting nodes with problem will fix it for short
> > >> period of time? Any suggestions from community for fixing?
> > >>
> > >> Thanks,
> > >> Ben
> > >>
> > >>
> > >> [root@8cd2c0657c77 /]# ceph health detail
> > >>
> > >> HEALTH_WARN 6 hosts fail cephadm check; 2 clients failing to respond
> to
> > >> capability release; 25 clients failing to advance oldest client/flush
> tid;
> > >> 3 MDSs report slow requests; 3 MDSs behind on trimming
> > >>
> > >> [WRN] CEPHADM_HOST_CHECK_FAILED: 6 hosts fail cephadm check
> > >>
> > >>     host host15w (192.168.31.33) failed check: Unable to reach remote
> > >> host host15w. Process exited with non-zero exit status 1
> > >>
> > >>     host host20w (192.168.31.38) failed check: Unable to reach remote
> > >> host host20w. Process exited with non-zero exit status 1
> > >>
> > >>     host host19w (192.168.31.37) failed check: Unable to reach remote
> > >> host host19w. Process exited with non-zero exit status 1
> > >>
> > >>     host host17w (192.168.31.35) failed check: Unable to reach remote
> > >> host host17w. Process exited with non-zero exit status 1
> > >>
> > >>     host host18w (192.168.31.36) failed check: Unable to reach remote
> > >> host host18w. Process exited with non-zero exit status 1
> > >>
> > >>     host host16w (192.168.31.34) failed check: Unable to reach remote
> > >> host host16w. Process exited with non-zero exit status 1
> > >>
> > >> [WRN] MDS_CLIENT_LATE_RELEASE: 2 clients failing to respond to
> capability
> > >> release
> > >>
> > >>     mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to
> > >> respond to capability release client_id: 460983
> > >>
> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to respond
> to
> > >> capability release client_id: 460983
> > >>
> > >> [WRN] MDS_CLIENT_OLDEST_TID: 25 clients failing to advance oldest
> > >> client/flush tid
> > >>
> > >>     mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to
> > >> advance its oldest client/flush tid.  client_id: 460983
> > >>
> > >>     mds.code-store.host18w.fdsqff(mds.1): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 460226
> > >>
> > >>     mds.code-store.host18w.fdsqff(mds.1): Client k8s-node32 failing to
> > >> advance its oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host15w.reolpx(mds.5): Client k8s-node34 failing to
> > >> advance its oldest client/flush tid.  client_id: 460226
> > >>
> > >>     mds.code-store.host15w.reolpx(mds.5): Client k8s-node32 failing to
> > >> advance its oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host15w.reolpx(mds.5): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 460983
> > >>
> > >>     mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node34 failing to
> > >> advance its oldest client/flush tid.  client_id: 460226
> > >>
> > >>     mds.code-store.host18w.rtyvdy(mds.7): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node36 failing to
> > >> advance its oldest client/flush tid.  client_id: 460983
> > >>
> > >>     mds.code-store.host17w.kcdopb(mds.2): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host17w.kcdopb(mds.2): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 460983
> > >>
> > >>     mds.code-store.host17w.kcdopb(mds.2): Client k8s-node34 failing to
> > >> advance its oldest client/flush tid.  client_id: 460226
> > >>
> > >>     mds.code-store.host17w.kcdopb(mds.2): Client k8s-node24 failing to
> > >> advance its oldest client/flush tid.  client_id: 12072730
> > >>
> > >>     mds.code-store.host20w.bfoftp(mds.4): Client k8s-node32 failing to
> > >> advance its oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host20w.bfoftp(mds.4): Client k8s-node36 failing to
> > >> advance its oldest client/flush tid.  client_id: 460983
> > >>
> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node24 failing to
> > >> advance its oldest client/flush tid.  client_id: 12072730
> > >>
> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node34 failing to
> > >> advance its oldest client/flush tid.  client_id: 460226
> > >>
> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 460983
> > >>
> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 460983
> > >>
> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 460226
> > >>
> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to advance
> its
> > >> oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host17w.pdziet(mds.0): Client k8s-node32 failing to
> > >> advance its oldest client/flush tid.  client_id: 239797
> > >>
> > >>     mds.code-store.host17w.pdziet(mds.0): Client k8s-node34 failing to
> > >> advance its oldest client/flush tid.  client_id: 460226
> > >>
> > >>     mds.code-store.host17w.pdziet(mds.0): Client k8s-node36 failing to
> > >> advance its oldest client/flush tid.  client_id: 460983
> > >>
> > >> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> > >>
> > >>     mds.code-store.host15w.reolpx(mds.5): 4 slow requests are blocked
> >
> > >> 5 secs
> > >>
> > >>     mds.code-store.host20w.bfoftp(mds.4): 6 slow requests are blocked
> >
> > >> 5 secs
> > >>
> > >>     mds.code-store.host16w.vucirx(mds.3): 97 slow requests are
> blocked >
> > >> 5 secs
> > >>
> > >> [WRN] MDS_TRIM: 3 MDSs behind on trimming
> > >>
> > >>     mds.code-store.host15w.reolpx(mds.5): Behind on trimming
> (25831/128)
> > >> max_segments: 128, num_segments: 25831
> > >>
> > >>     mds.code-store.host20w.bfoftp(mds.4): Behind on trimming
> (27605/128)
> > >> max_segments: 128, num_segments: 27605
> > >>
> > >>     mds.code-store.host16w.vucirx(mds.3): Behind on trimming
> (28676/128)
> > >> max_segments: 128, num_segments: 28676
> > >>
> > >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> --
> Cheers,
> Venky
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux