Re: cephfs health warn

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ben,

On Tue, Oct 3, 2023 at 8:56 PM Ben <ruidong.gao@xxxxxxxxx> wrote:
>
> Yes, I am. 8 active + 2 standby, no subtree pinning. What if I restart the mds with trimming issues? Trying to figure out what happens with restarting.

We have come across instances in the past where multimds without
subtree pinning can lead to accumulation of log segments which then
leada to trim warnings. This happens due to the default mds balancer
misbehaving. We have a change that's pending merge (and backport)
which switches off the default balancer for this very reason.

        https://github.com/ceph/ceph/pull/52196

Suggest using single active mds or multimds with subtree pinning.

>
> Venky Shankar <vshankar@xxxxxxxxxx> 于2023年10月3日周二 12:39写道:
>>
>> Hi Ben,
>>
>> Are you using multimds without subtree pinning?
>>
>> On Tue, Oct 3, 2023 at 10:00 AM Ben <ruidong.gao@xxxxxxxxx> wrote:
>> >
>> > Dear cephers:
>> > more log captures(see below) show the full segments list(more than 30000 to
>> > be trimmed stuck, growing over time). any ideas to get out of this?
>> >
>> > Thanks,
>> > Ben
>> >
>> >
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195341004/893374309813, 180 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195341184/893386318445, 145 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195341329/893386757388, 1024 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195342353/893388361174, 1024 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195343377/893389870480, 790 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195344167/893390955408, 1024 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195345191/893392321470, 1024 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195346215/893393752928, 1024 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195347239/893395131457, 2 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195347241/893395212055, 1024 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195348265/893396582755, 1024 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195349289/893398132191, 860 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expiring segment 195350149/893399338619, 42 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195350192/893408004655, 33 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195350226/893412331017, 23 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195350249/893416563419, 20 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195350269/893420702085, 244 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195350513/893424694857, 74 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195350587/893428947395, 843 events
>> > debug 2023-09-30T14:34:14.557+0000 7f9c29bb1700 5 mds.4.log trim already
>> > expired segment 195351430/893432893900, 1019 events
>> > .
>> > . (all expired items abbreviated)
>> > .
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
>> > expired segment 216605661/827226016068, 100 events
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
>> > expired segment 216605761/827230263164, 153 events
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
>> > expired segment 216605914/827234408294, 35 events
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
>> > expired segment 216605949/827238527911, 1024 events
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
>> > expired segment 216606973/827241813316, 344 events
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log trim already
>> > expired segment 216607317/827242580233, 1024 events
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 6 mds.3.journal LogSegment(
>> > 216608341/827244781542).try_to_expire
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 4 mds.3.sessionmap
>> > save_if_dirty: writing 0
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 6 mds.3.journal LogSegment(
>> > 216608341/827244781542).try_to_expire success
>> > debug 2023-09-30T15:10:56.521+0000 7fc752167700 5 mds.3.log _expired
>> > segment 216608341/827244781542, 717 events
>> >
>> > Ben <ruidong.gao@xxxxxxxxx> 于2023年9月27日周三 23:53写道:
>> >
>> > > some further investigation about three mds with trimming behind problem:
>> > > logs captured over two days show that, some log segments are stuck in
>> > > trimming process. It looks like a bug with trimming log segment? Any
>> > > thoughts?
>> > > ==========log capture============
>> > >
>> > > 9/26:
>> > >
>> > > debug 2023-09-26T16:50:59.004+0000 7fc74d95e700 10 mds.3.log
>> > > _trim_expired_segments waiting for 197465903/720757956586 to expire
>> > >
>> > >
>> > > debug 2023-09-26T16:45:22.788+0000 7f9c253a8700 10 mds.4.log
>> > > _trim_expired_segments waiting for 195341004/893374309813 to expire
>> > >
>> > >
>> > > debug 2023-09-26T14:11:37.681+0000 7f8f047cf700 10 mds.5.log
>> > > _trim_expired_segments waiting for 189942575/642272326541 to expire
>> > >
>> > >
>> > > 9/27:
>> > >
>> > > debug 2023-09-27T15:35:04.841+0000 7fc74d95e700 10 mds.3.log
>> > > _trim_expired_segments waiting for 197465903/720757956586 to expire
>> > >
>> > >
>> > > debug 2023-09-27T15:18:51.442+0000 7f9c29bb1700 10 mds.4.log
>> > > _trim_expired_segments waiting for 195341004/893374309813 to expire
>> > >
>> > >
>> > > debug 2023-09-27T15:27:33.024+0000 7f8f047cf700 10 mds.5.log
>> > > _trim_expired_segments waiting for 189942575/642272326541 to expire
>> > >
>> > >
>> > > Regards,
>> > >
>> > > Ben
>> > >
>> > > Ben <ruidong.gao@xxxxxxxxx> 于2023年9月26日周二 20:31写道:
>> > >
>> > >> Hi,
>> > >> see below for details of warnings.
>> > >> the cluster is running 17.2.5. the warnings have been around for a while.
>> > >> one concern of mine is num_segments growing over time. clients with warn
>> > >> of MDS_CLIENT_OLDEST_TID increase from 18 to 25 as well. The nodes are
>> > >> with kernel 4.19.0-91.82.42.uelc20.x86_64. It looks like bugs with
>> > >> client library. And rebooting nodes with problem will fix it for short
>> > >> period of time? Any suggestions from community for fixing?
>> > >>
>> > >> Thanks,
>> > >> Ben
>> > >>
>> > >>
>> > >> [root@8cd2c0657c77 /]# ceph health detail
>> > >>
>> > >> HEALTH_WARN 6 hosts fail cephadm check; 2 clients failing to respond to
>> > >> capability release; 25 clients failing to advance oldest client/flush tid;
>> > >> 3 MDSs report slow requests; 3 MDSs behind on trimming
>> > >>
>> > >> [WRN] CEPHADM_HOST_CHECK_FAILED: 6 hosts fail cephadm check
>> > >>
>> > >>     host host15w (192.168.31.33) failed check: Unable to reach remote
>> > >> host host15w. Process exited with non-zero exit status 1
>> > >>
>> > >>     host host20w (192.168.31.38) failed check: Unable to reach remote
>> > >> host host20w. Process exited with non-zero exit status 1
>> > >>
>> > >>     host host19w (192.168.31.37) failed check: Unable to reach remote
>> > >> host host19w. Process exited with non-zero exit status 1
>> > >>
>> > >>     host host17w (192.168.31.35) failed check: Unable to reach remote
>> > >> host host17w. Process exited with non-zero exit status 1
>> > >>
>> > >>     host host18w (192.168.31.36) failed check: Unable to reach remote
>> > >> host host18w. Process exited with non-zero exit status 1
>> > >>
>> > >>     host host16w (192.168.31.34) failed check: Unable to reach remote
>> > >> host host16w. Process exited with non-zero exit status 1
>> > >>
>> > >> [WRN] MDS_CLIENT_LATE_RELEASE: 2 clients failing to respond to capability
>> > >> release
>> > >>
>> > >>     mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to
>> > >> respond to capability release client_id: 460983
>> > >>
>> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to respond to
>> > >> capability release client_id: 460983
>> > >>
>> > >> [WRN] MDS_CLIENT_OLDEST_TID: 25 clients failing to advance oldest
>> > >> client/flush tid
>> > >>
>> > >>     mds.code-store.host18w.fdsqff(mds.1): Client k8s-node36 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460983
>> > >>
>> > >>     mds.code-store.host18w.fdsqff(mds.1): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 460226
>> > >>
>> > >>     mds.code-store.host18w.fdsqff(mds.1): Client k8s-node32 failing to
>> > >> advance its oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host15w.reolpx(mds.5): Client k8s-node34 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460226
>> > >>
>> > >>     mds.code-store.host15w.reolpx(mds.5): Client k8s-node32 failing to
>> > >> advance its oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host15w.reolpx(mds.5): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 460983
>> > >>
>> > >>     mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node34 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460226
>> > >>
>> > >>     mds.code-store.host18w.rtyvdy(mds.7): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host18w.rtyvdy(mds.7): Client k8s-node36 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460983
>> > >>
>> > >>     mds.code-store.host17w.kcdopb(mds.2): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host17w.kcdopb(mds.2): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 460983
>> > >>
>> > >>     mds.code-store.host17w.kcdopb(mds.2): Client k8s-node34 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460226
>> > >>
>> > >>     mds.code-store.host17w.kcdopb(mds.2): Client k8s-node24 failing to
>> > >> advance its oldest client/flush tid.  client_id: 12072730
>> > >>
>> > >>     mds.code-store.host20w.bfoftp(mds.4): Client k8s-node32 failing to
>> > >> advance its oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host20w.bfoftp(mds.4): Client k8s-node36 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460983
>> > >>
>> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node24 failing to
>> > >> advance its oldest client/flush tid.  client_id: 12072730
>> > >>
>> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client k8s-node34 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460226
>> > >>
>> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host19w.ywrmiz(mds.6): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 460983
>> > >>
>> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 460983
>> > >>
>> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 460226
>> > >>
>> > >>     mds.code-store.host16w.vucirx(mds.3): Client  failing to advance its
>> > >> oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host17w.pdziet(mds.0): Client k8s-node32 failing to
>> > >> advance its oldest client/flush tid.  client_id: 239797
>> > >>
>> > >>     mds.code-store.host17w.pdziet(mds.0): Client k8s-node34 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460226
>> > >>
>> > >>     mds.code-store.host17w.pdziet(mds.0): Client k8s-node36 failing to
>> > >> advance its oldest client/flush tid.  client_id: 460983
>> > >>
>> > >> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
>> > >>
>> > >>     mds.code-store.host15w.reolpx(mds.5): 4 slow requests are blocked >
>> > >> 5 secs
>> > >>
>> > >>     mds.code-store.host20w.bfoftp(mds.4): 6 slow requests are blocked >
>> > >> 5 secs
>> > >>
>> > >>     mds.code-store.host16w.vucirx(mds.3): 97 slow requests are blocked >
>> > >> 5 secs
>> > >>
>> > >> [WRN] MDS_TRIM: 3 MDSs behind on trimming
>> > >>
>> > >>     mds.code-store.host15w.reolpx(mds.5): Behind on trimming (25831/128)
>> > >> max_segments: 128, num_segments: 25831
>> > >>
>> > >>     mds.code-store.host20w.bfoftp(mds.4): Behind on trimming (27605/128)
>> > >> max_segments: 128, num_segments: 27605
>> > >>
>> > >>     mds.code-store.host16w.vucirx(mds.3): Behind on trimming (28676/128)
>> > >> max_segments: 128, num_segments: 28676
>> > >>
>> > >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> --
>> Cheers,
>> Venky
>>


-- 
Cheers,
Venky
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux