Hi Eugen, thanks for your input. I can't query the hung MDS, but the others say this here: ceph tell mds.ceph-14 perf dump throttle-write_buf_throttle { "throttle-write_buf_throttle": { "val": 0, "max": 3758096384, "get_started": 0, "get": 5199, "get_sum": 566691, "get_or_fail_fail": 0, "get_or_fail_success": 5199, "take": 0, "take_sum": 0, "put": 719, "put_sum": 566691, "wait": { "avgcount": 0, "sum": 0.000000000, "avgtime": 0.000000000 } } } You might be on to something, we are also trying to find where this limit comes from. Please keep us posted. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: Monday, January 20, 2025 11:12 AM To: ceph-users@xxxxxxx Subject: Re: MDS hung in purge_stale_snap_data after populating cache Hi Frank, are you able to query the daemon while it's trying to purge the snaps? pacific:~ # ceph tell mds.{your_daemon} perf dump throttle-write_buf_throttle ... "max": 3758096384, I don't know yet where that "max" setting comes from, but I'll keep looking. Zitat von Frank Schilder <frans@xxxxxx>: > Hi all, > > we tracked the deadlock down to line > https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583 > in Journaler::append_entry(bufferlist& bl): > > // append > size_t delta = bl.length() + journal_stream.get_envelope_size(); > // write_buf space is nearly full > if (!write_buf_throttle.get_or_fail(delta)) { > l.unlock(); > ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl; > write_buf_throttle.get(delta); //<<<<<<<<< The MDS is stuck > here <<<<<<<<< > l.lock(); > } > ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl; > > This is indicated by the last message in the log before the lock up, > which reads > > mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101 > > and is generated by the line above the call > write_buf_throttle.get(delta). All log messages messages before > start with "write_buf_throttle get, delta", which means these did > not go into the if-statement. > > Obvious question is, which parameter influences the maximum size of > the variable Journaler::write_buffer > (https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the > class definition of class Journaler? Increasing this limit should > get us past the deadlock. > > Thanks for your help and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: Friday, January 17, 2025 3:02 PM > To: Bailey Allison; ceph-users@xxxxxxx > Subject: Re: MDS hung in purge_stale_snap_data after > populating cache > > Hi Bailey. > > ceph-14 (rank=0): num_stray=205532 > ceph-13 (rank=1): num_stray=4446 > ceph-21-mds (rank=2): num_stray=99446249 > ceph-23 (rank=3): num_stray=3412 > ceph-08 (rank=4): num_stray=1238 > ceph-15 (rank=5): num_stray=1486 > ceph-16 (rank=6): num_stray=5545 > ceph-11 (rank=7): num_stray=2995 > > The stats for rank 2 are almost certainly out of date though. The > config dump is large, but since you asked. Its only 3 settings that > are present for maintenance and workaround reasons: > mds_beacon_grace, auth_service_ticket_ttl and > mon_osd_report_timeout. The last is for a different issue though. > > WHO MASK LEVEL > OPTION VALUE RO > global advanced auth_service_ticket_ttl > 129600.000000 > global advanced mds_beacon_grace > 1209600.000000 > global advanced mon_pool_quota_crit_threshold 90 > global advanced mon_pool_quota_warn_threshold 70 > global dev mon_warn_on_pool_pg_num_not_power_of_two false > global advanced osd_map_message_max_bytes 16384 > global advanced osd_op_queue > wpq * > global advanced osd_op_queue_cut_off > high * > global advanced osd_pool_default_pg_autoscale_mode off > mon advanced mon_allow_pool_delete false > mon advanced mon_osd_down_out_subtree_limit host > mon advanced mon_osd_min_down_reporters 3 > mon advanced mon_osd_report_timeout 86400 > mon advanced mon_osd_reporter_subtree_level host > mon advanced mon_pool_quota_warn_threshold 70 > mon advanced mon_sync_max_payload_size 4096 > mon advanced mon_warn_on_insecure_global_id_reclaim false > mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false > mgr advanced mgr/balancer/active false > mgr advanced mgr/dashboard/ceph-01/server_addr > 10.40.88.65 * > mgr advanced mgr/dashboard/ceph-02/server_addr > 10.40.88.66 * > mgr advanced mgr/dashboard/ceph-03/server_addr > 10.40.88.67 * > mgr advanced mgr/dashboard/server_port > 8443 * > mgr advanced mon_pg_warn_max_object_skew > 10.000000 > mgr basic target_max_misplaced_ratio > 1.000000 > osd advanced bluefs_buffered_io true > osd advanced bluestore_compression_min_blob_size_hdd > 262144 > osd advanced bluestore_compression_min_blob_size_ssd 65536 > osd advanced bluestore_compression_mode > aggressive > osd class:rbd_perf advanced > bluestore_compression_mode none > osd dev bluestore_fsck_quick_fix_on_mount false > osd advanced osd_deep_scrub_randomize_ratio > 0.000000 > osd class:hdd advanced > osd_delete_sleep 300.000000 > osd advanced osd_fast_shutdown false > osd class:fs_meta advanced > osd_max_backfills 12 > osd class:hdd advanced > osd_max_backfills 3 > osd class:rbd_data advanced > osd_max_backfills 6 > osd class:rbd_meta advanced > osd_max_backfills 12 > osd class:rbd_perf advanced > osd_max_backfills 12 > osd class:ssd advanced > osd_max_backfills 12 > osd advanced osd_max_backfills 3 > osd class:fs_meta dev > osd_memory_cache_min 2147483648 > osd class:hdd dev > osd_memory_cache_min 1073741824 > osd class:rbd_data dev > osd_memory_cache_min 2147483648 > osd class:rbd_meta dev > osd_memory_cache_min 1073741824 > osd class:rbd_perf dev > osd_memory_cache_min 2147483648 > osd class:ssd dev > osd_memory_cache_min 2147483648 > osd dev osd_memory_cache_min > 805306368 > osd class:fs_meta basic > osd_memory_target 6442450944 > osd class:hdd basic > osd_memory_target 3221225472 > osd class:rbd_data basic > osd_memory_target 4294967296 > osd class:rbd_meta basic > osd_memory_target 2147483648 > osd class:rbd_perf basic > osd_memory_target 6442450944 > osd class:ssd basic > osd_memory_target 4294967296 > osd basic osd_memory_target > 2147483648 > osd class:rbd_perf advanced > osd_op_num_threads_per_shard 4 * > osd class:hdd advanced > osd_recovery_delay_start 600.000000 > osd class:rbd_data advanced > osd_recovery_delay_start 300.000000 > osd class:rbd_perf advanced > osd_recovery_delay_start 300.000000 > osd class:fs_meta advanced > osd_recovery_max_active 32 > osd class:hdd advanced > osd_recovery_max_active 8 > osd class:rbd_data advanced > osd_recovery_max_active 16 > osd class:rbd_meta advanced > osd_recovery_max_active 32 > osd class:rbd_perf advanced > osd_recovery_max_active 16 > osd class:ssd advanced > osd_recovery_max_active 32 > osd advanced osd_recovery_max_active 8 > osd class:fs_meta advanced > osd_recovery_sleep 0.002500 > osd class:hdd advanced > osd_recovery_sleep 0.050000 > osd class:rbd_data advanced > osd_recovery_sleep 0.025000 > osd class:rbd_meta advanced > osd_recovery_sleep 0.002500 > osd class:rbd_perf advanced > osd_recovery_sleep 0.010000 > osd class:ssd advanced > osd_recovery_sleep 0.002500 > osd advanced osd_recovery_sleep > 0.050000 > osd class:hdd dev > osd_scrub_backoff_ratio 0.330000 > osd class:hdd advanced > osd_scrub_during_recovery true > osd advanced osd_scrub_load_threshold > 0.750000 > osd class:fs_meta advanced > osd_snap_trim_sleep 0.050000 > osd class:hdd advanced > osd_snap_trim_sleep 2.000000 > osd class:rbd_data advanced > osd_snap_trim_sleep 0.100000 > mds basic client_cache_size 8192 > mds advanced defer_client_eviction_on_laggy_osds false > mds advanced mds_bal_fragment_size_max > 100000 > mds basic mds_cache_memory_limit > 25769803776 > mds advanced mds_cache_reservation > 0.500000 > mds advanced mds_max_caps_per_client 65536 > mds advanced mds_min_caps_per_client 4096 > mds advanced mds_recall_max_caps 32768 > mds advanced mds_session_blocklist_on_timeout false > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Bailey Allison <ballison@xxxxxxxxxxxx> > Sent: Thursday, January 16, 2025 10:08 PM > To: ceph-users@xxxxxxx > Subject: Re: MDS hung in purge_stale_snap_data after > populating cache > > Frank, > > Are you able to share an update to date ceph config dump and ceph daemon > mds.X perf dump | grep strays from the cluster? > > We're just getting through our comically long ceph outage, so i'd like > to be able to share the love here hahahaha > > Regards, > > Bailey Allison > Service Team Lead > 45Drives, Ltd. > 866-594-7199 x868 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx