Re: Mds daemon damaged - assert failed

"Kyriazis, George" <george.kyriazis@xxxxxxxxx> · Fri, 27 Sep 2024 15:05:33 +0000

I am running 18.2.2, which apparently is the latest one available for proxmox at this time (9/2024).

I’d rather not mess around with backporting and testing fixes at this point, since this is our “production” cluster..  If it was not a production one, then I could possibly play around with this, given there was free time. :-)

Thank you for looking it up!

George

> On Sep 27, 2024, at 3:44 AM, Konstantin Shalygin <k0ste@xxxxxxxx> wrote:
> 
> Hi,
> 
> The [2] is the fix for [1] and should be backported? Currently fields are not filled, so no one knows that backports are needed
> 
> 
> k
> 
>> On 27 Sep 2024, at 11:01, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:
>> 
>> Hi George,
>> 
>> Looks like you hit this one [1]. Can't find the fix [2] in Reef release notes [3]. You'll have to cherry pick it and build sources or wait for it to come to next build.
>> 
>> Regards,
>> Frédéric.
>> 
>> [1] https://tracker.ceph.com/issues/58878
>> [2] https://github.com/ceph/ceph/pull/55265
>> [3] https://docs.ceph.com/en/latest/releases/reef/#v18-2-4-reef
>> 
>> ----- Le 24 Sep 24, à 0:32, Kyriazis, George george.kyriazis@xxxxxxxxx a écrit :
>> 
>>> Hello ceph users,
>>> 
>>> I am in the unfortunate situation of having a status of “1 mds daemon damaged”.
>>> Looking at the logs, I see that the daemon died with an assert as follows:
>>> 
>>> ./src/osdc/Journaler.cc: 1368: FAILED ceph_assert(trim_to > trimming_pos)
>>> 
>>> ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12a)
>>> [0x73a83189d7d9]
>>> 2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
>>> 3: (Journaler::_trim()+0x671) [0x57235caa70b1]
>>> 4: (Journaler::_finish_write_head(int, Journaler::Header&, C_OnFinisher*)+0x171)
>>> [0x57235caaa8f1]
>>> 5: (Context::complete(int)+0x9) [0x57235c716849]
>>> 6: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
>>> 7: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
>>> 8: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]
>>> 
>>>   0> 2024-09-23T14:10:26.490-0500 73a822c006c0 -1 *** Caught signal (Aborted) **
>>> in thread 73a822c006c0 thread_name:MR_Finisher
>>> 
>>> ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
>>> 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x73a83105b050]
>>> 2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x73a8310a9e2c]
>>> 3: gsignal()
>>> 4: abort()
>>> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x185)
>>> [0x73a83189d834]
>>> 6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
>>> 7: (Journaler::_trim()+0x671) [0x57235caa70b1]
>>> 8: (Journaler::_finish_write_head(int, Journaler::Header&, C_OnFinisher*)+0x171)
>>> [0x57235caaa8f1]
>>> 9: (Context::complete(int)+0x9) [0x57235c716849]
>>> 10: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
>>> 11: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
>>> 12: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]
>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>>> interpret this.
>>> 
>>> 
>>> As listed above, I am running 18.2.2 on a proxmox cluster with a hybrid hdd/sdd
>>> setup.  2 cephfs filesystems.  The mds responsible for the hdd filesystem is
>>> the one that died.
>>> 
>>> Output of ceph -s follows:
>>> 
>>> root@vis-mgmt:~/bin# ceph -s
>>> cluster:
>>>  id:     ec2c9542-dc1b-4af6-9f21-0adbcabb9452
>>>  health: HEALTH_ERR
>>>          1 filesystem is degraded
>>>          1 filesystem is offline
>>>          1 mds daemon damaged
>>>          5 pgs not scrubbed in time
>>>          1 daemons have recently crashed
>>>  services:
>>>  mon: 5 daemons, quorum vis-hsw-01,vis-skx-01,vis-clx-15,vis-clx-04,vis-icx-00
>>>  (age 6m)
>>>  mgr: vis-hsw-02(active, since 13d), standbys: vis-skx-02, vis-hsw-04,
>>>  vis-clx-08, vis-clx-02
>>>  mds: 1/2 daemons up, 5 standby
>>>  osd: 97 osds: 97 up (since 3h), 97 in (since 4d)
>>>  data:
>>>  volumes: 1/2 healthy, 1 recovering; 1 damaged
>>>  pools:   14 pools, 1961 pgs
>>>  objects: 223.70M objects, 304 TiB
>>>  usage:   805 TiB used, 383 TiB / 1.2 PiB avail
>>>  pgs:     1948 active+clean
>>>           9    active+clean+scrubbing+deep
>>>           4    active+clean+scrubbing
>>>  io:
>>>  client:   86 KiB/s rd, 5.5 MiB/s wr, 64 op/s rd, 26 op/s wr
>>> 
>>> 
>>> 
>>> I tried restarting all the mds deamons but they are all marked as “standby”.  I
>>> also tried restarting all the mons and then the mds daemons again, but that
>>> didn’t help.
>>> 
>>> Much help is appreciated!
>>> 
>>> Thank you!
>>> 
>>> George
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx