Ok, we will create the ticket. Eugen Block - ceph tell command needs to communicate with the MDS daemon running, but it is crashed. So, I just have the information about the impossibility to receive the information from daemon: ceph tell mds.0 damage ls Error ENOENT: problem getting command descriptions from mds.0 --- Best regards, Alexey Gerasimov System Manager www.opencascade.com www.capgemini.com -----Original Message----- From: Xiubo Li <xiubli@xxxxxxxxxx> Sent: Monday, April 22, 2024 2:21 AM To: Alexey GERASIMOV <alexey.gerasimov@xxxxxxxxxxxxxxx>; ceph-users@xxxxxxx Subject: Re: MDS crash Hi Alexey, This looks a new issue for me. Please create a tracker for it and provide the detail call trace there. Thanks - Xiubo On 4/19/24 05:42, alexey.gerasimov@xxxxxxxxxxxxxxx wrote: > Dear colleagues, hope that anybody can help us. > > The initial point: Ceph cluster v15.2 (installed and controlled by the Proxmox) with 3 nodes based on physical servers rented from a cloud provider. CephFS is installed also. > > Yesterday we discovered that some of the applications stopped working. During the investigation we recognized that we have the problem with Ceph, more precisely with СephFS - MDS daemons suddenly crashed. We tried to restart them and found that they crashed again immediately after the start. The crash information: > 2024-04-17T17:47:42.841+0000 7f959ced9700 1 mds.0.29134 recovery_done -- successful recovery! > 2024-04-17T17:47:42.853+0000 7f959ced9700 1 mds.0.29134 active_start > 2024-04-17T17:47:42.881+0000 7f959ced9700 1 mds.0.29134 cluster recovered. > 2024-04-17T17:47:43.825+0000 7f959aed5700 -1 > ./src/mds/OpenFileTable.cc: In function 'void > OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 7f959aed5700 > time 2024-04-17T17:47:43.831243+0000 > ./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0) > > Next hours we read the tons of articles, studied the documentation, and checked the common state of Ceph cluster by the various diagnostic commands – but didn’t find anything wrong. At evening we decided to upgrade it up to v16, and finally to v17.2.7. Unfortunately, it didn’t solve the problem, MDS continue to crash with the same error. The only difference that we found is “1 MDSs report damaged metadata” in the output of ceph -s – see it below. > > I supposed that it may be the well-known bug, but couldn’t find the > same one on https://tracker.ceph.com - there are several bugs > associated with file OpenFileTable.cc but not related to > ceph_assert(count > 0) > > We tried to check the source code of OpenFileTable.cc also, here is a fragment of it, in function OpenFileTable::_journal_finish > int omap_idx = anchor.omap_idx; > unsigned& count = omap_num_items.at(omap_idx); > ceph_assert(count > 0); > So, we guess that the object map is empty for some object in Ceph, and > it is unexpected behavior. But again, we found nothing wrong in our > cluster… > > Next, we started with > https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article – tried to reset the journal (despite that it was Ok all the time) and wipe the sessions using cephfs-table-tool all reset session command. No result… Now I decided to continue following this article and run cephfs-data-scan scan_extents command, it is working just now. But I have a doubt that it will solve the issue because of no problem with our objects in Ceph. > > Is it the new bug? or something else? Any idea is welcome! > > The important outputs: > > ----- ceph -s > cluster: > id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed > health: HEALTH_ERR > 1 MDSs report damaged metadata > insufficient standby MDS daemons available > 83 daemons have recently crashed > 3 mgr modules have recently crashed > > services: > mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h) > mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1 > mds: 1/1 daemons up > osd: 18 osds: 18 up (since 22h), 18 in (since 29h) > > data: > volumes: 1/1 healthy > pools: 5 pools, 289 pgs > objects: 29.72M objects, 5.6 TiB > usage: 21 TiB used, 47 TiB / 68 TiB avail > pgs: 287 active+clean > 2 active+clean+scrubbing+deep > > io: > client: 2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr > > -----ceph fs dump > e29480 > enable_multiple, ever_enabled_multiple: 0,1 default compat: > compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds > uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > legacy client fscid: 1 > > Filesystem 'cephfs' (1) > fs_name cephfs > epoch 29480 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2022-11-25T15:56:08.507407+0000 > modified 2024-04-18T16:52:29.970504+0000 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 14728 > compat compat={},rocompat={},incompat={1=base v0.20,2=client > writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 > in 0 > up {0=156636152} > failed > damaged > stopped > data_pools [5] > metadata_pool 6 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since > 2024-04-18T16:52:29.970479+0000 addr > [v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat > {c=[1],r=[1],i=[7ff]}] > > -----cephfs-journal-tool --rank=cephfs:0 journal inspect Overall > journal integrity: OK > > -----ceph pg dump summary > version 41137 > stamp 2024-04-18T21:17:59.133536+0000 > last_osdmap_epoch 0 > last_pg_scan 0 > PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG > sum 29717605 0 0 0 0 6112544251872 13374192956 28493480 1806575 1806575 > OSD_STAT USED AVAIL USED_RAW TOTAL > sum 21 TiB 47 TiB 21 TiB 68 TiB > > -----ceph pg dump pools > POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG > 8 31771 0 0 0 0 131337887503 2482 140 401246 401246 > 7 839707 0 0 0 0 3519034650971 736 61 399328 399328 > 6 1319576 0 0 0 0 421044421 13374189738 28493279 206749 206749 > 5 27526539 0 0 0 0 2461702171417 0 0 792165 792165 > 2 12 0 0 0 0 48497560 0 0 6991 6991 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx