> which is 3758096384. I'm not even sure what the unit is, probably bytes? As far as I understand the unit is "list items". They can have variable length. On our system about 400G are allocated while filling up the bufferlist. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Monday, January 20, 2025 1:38 PM To: Eugen Block Cc: ceph-users@xxxxxxx Subject: Re: Re: MDS hung in purge_stale_snap_data after populating cache Hi Eugen, I think the default is just a "reasonably large number" that's not too large. Looking at the code line you found: write_buf_throttle(cct, "write_buf_throttle", UINT_MAX - (UINT_MAX >> 3)), my gut feeling is that rebuilding it with this change (factor 4): write_buf_throttle(cct, "write_buf_throttle", 4*( UINT_MAX - (UINT_MAX >> 3)) ), will do the trick for us. The arguments are all int64, so there should be no overflow issues down the line. The factor 4 could also be an ENV variable to be able to restart the MDS with different scalings if required. The class Throttle does have a reset_max method, but I'm not sure if it is called anywhere and if it is possible to call it and change the max at runtime via things line "ceph daemon" or "ceph tell" in some way. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: Monday, January 20, 2025 1:25 PM To: Frank Schilder Cc: ceph-users@xxxxxxx Subject: Re: Re: MDS hung in purge_stale_snap_data after populating cache Hi, right, I haven't found a parameter for this to tune. Some throttling parameters are tunable, though. For example when I created https://tracker.ceph.com/issues/66310, where I assume that the default for mgr_mon_messages is too low (which shows up as throttle-mgr_mon_messsages in the perf dump). But you can't make everything configurable, I guess. I have no idea if skipping is possible, I've been also looking for all kinds of mds related config parameters, but it's not always clear what they are for. So fingers are crossed that you get out of that quickly. Zitat von Frank Schilder <frans@xxxxxx>: > Hi Eugen, > > yeah, I think you found it. That would also mean there is no > parameter to scale that. I wonder if it is possible to skip the > initial run of purge_stale_snap_data, have a lot of trash in the > cache and use the forward-scrub to deal with the stray items. > > Well, we got in touch with some companies offering emergency support > and hope this can be fixed with reasonable effort and time. > > Thanks for your help! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Eugen Block <eblock@xxxxxx> > Sent: Monday, January 20, 2025 12:40 PM > To: Frank Schilder > Cc: ceph-users@xxxxxxx > Subject: Re: Re: MDS hung in purge_stale_snap_data > after populating cache > > It looks like a hard-coded max for the throttle: > > write_buf_throttle(cct, "write_buf_throttle", UINT_MAX - (UINT_MAX >> 3)), > > which is 3758096384. I'm not even sure what the unit is, probably bytes? > > https://github.com/ceph/ceph/blob/v16.2.15/src/osdc/Journaler.h#L410 > > Zitat von Frank Schilder <frans@xxxxxx>: > >> Hi Eugen, >> >> thanks for your input. I can't query the hung MDS, but the others >> say this here: >> >> ceph tell mds.ceph-14 perf dump throttle-write_buf_throttle >> { >> "throttle-write_buf_throttle": { >> "val": 0, >> "max": 3758096384, >> "get_started": 0, >> "get": 5199, >> "get_sum": 566691, >> "get_or_fail_fail": 0, >> "get_or_fail_success": 5199, >> "take": 0, >> "take_sum": 0, >> "put": 719, >> "put_sum": 566691, >> "wait": { >> "avgcount": 0, >> "sum": 0.000000000, >> "avgtime": 0.000000000 >> } >> } >> } >> >> You might be on to something, we are also trying to find where this >> limit comes from. >> >> Please keep us posted. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Eugen Block <eblock@xxxxxx> >> Sent: Monday, January 20, 2025 11:12 AM >> To: ceph-users@xxxxxxx >> Subject: Re: MDS hung in purge_stale_snap_data after >> populating cache >> >> Hi Frank, >> >> are you able to query the daemon while it's trying to purge the snaps? >> >> pacific:~ # ceph tell mds.{your_daemon} perf dump >> throttle-write_buf_throttle >> ... >> "max": 3758096384, >> >> I don't know yet where that "max" setting comes from, but I'll keep looking. >> >> Zitat von Frank Schilder <frans@xxxxxx>: >> >>> Hi all, >>> >>> we tracked the deadlock down to line >>> https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583 >>> in Journaler::append_entry(bufferlist& bl): >>> >>> // append >>> size_t delta = bl.length() + journal_stream.get_envelope_size(); >>> // write_buf space is nearly full >>> if (!write_buf_throttle.get_or_fail(delta)) { >>> l.unlock(); >>> ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl; >>> write_buf_throttle.get(delta); //<<<<<<<<< The MDS is stuck >>> here <<<<<<<<< >>> l.lock(); >>> } >>> ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl; >>> >>> This is indicated by the last message in the log before the lock up, >>> which reads >>> >>> mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101 >>> >>> and is generated by the line above the call >>> write_buf_throttle.get(delta). All log messages messages before >>> start with "write_buf_throttle get, delta", which means these did >>> not go into the if-statement. >>> >>> Obvious question is, which parameter influences the maximum size of >>> the variable Journaler::write_buffer >>> (https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in >>> the >>> class definition of class Journaler? Increasing this limit should >>> get us past the deadlock. >>> >>> Thanks for your help and best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Frank Schilder <frans@xxxxxx> >>> Sent: Friday, January 17, 2025 3:02 PM >>> To: Bailey Allison; ceph-users@xxxxxxx >>> Subject: Re: MDS hung in purge_stale_snap_data after >>> populating cache >>> >>> Hi Bailey. >>> >>> ceph-14 (rank=0): num_stray=205532 >>> ceph-13 (rank=1): num_stray=4446 >>> ceph-21-mds (rank=2): num_stray=99446249 >>> ceph-23 (rank=3): num_stray=3412 >>> ceph-08 (rank=4): num_stray=1238 >>> ceph-15 (rank=5): num_stray=1486 >>> ceph-16 (rank=6): num_stray=5545 >>> ceph-11 (rank=7): num_stray=2995 >>> >>> The stats for rank 2 are almost certainly out of date though. The >>> config dump is large, but since you asked. Its only 3 settings that >>> are present for maintenance and workaround reasons: >>> mds_beacon_grace, auth_service_ticket_ttl and >>> mon_osd_report_timeout. The last is for a different issue though. >>> >>> WHO MASK LEVEL >>> OPTION VALUE RO >>> global advanced auth_service_ticket_ttl >>> 129600.000000 >>> global advanced mds_beacon_grace >>> 1209600.000000 >>> global advanced mon_pool_quota_crit_threshold 90 >>> global advanced mon_pool_quota_warn_threshold 70 >>> global dev mon_warn_on_pool_pg_num_not_power_of_two >>> false >>> global advanced osd_map_message_max_bytes >>> 16384 >>> global advanced osd_op_queue >>> wpq * >>> global advanced osd_op_queue_cut_off >>> high * >>> global advanced osd_pool_default_pg_autoscale_mode off >>> mon advanced mon_allow_pool_delete >>> false >>> mon advanced mon_osd_down_out_subtree_limit >>> host >>> mon advanced mon_osd_min_down_reporters 3 >>> mon advanced mon_osd_report_timeout >>> 86400 >>> mon advanced mon_osd_reporter_subtree_level >>> host >>> mon advanced mon_pool_quota_warn_threshold 70 >>> mon advanced mon_sync_max_payload_size >>> 4096 >>> mon advanced mon_warn_on_insecure_global_id_reclaim >>> false >>> mon advanced >>> mon_warn_on_insecure_global_id_reclaim_allowed false >>> mgr advanced mgr/balancer/active >>> false >>> mgr advanced mgr/dashboard/ceph-01/server_addr >>> 10.40.88.65 * >>> mgr advanced mgr/dashboard/ceph-02/server_addr >>> 10.40.88.66 * >>> mgr advanced mgr/dashboard/ceph-03/server_addr >>> 10.40.88.67 * >>> mgr advanced mgr/dashboard/server_port >>> 8443 * >>> mgr advanced mon_pg_warn_max_object_skew >>> 10.000000 >>> mgr basic target_max_misplaced_ratio >>> 1.000000 >>> osd advanced bluefs_buffered_io >>> true >>> osd advanced bluestore_compression_min_blob_size_hdd >>> 262144 >>> osd advanced bluestore_compression_min_blob_size_ssd >>> 65536 >>> osd advanced bluestore_compression_mode >>> aggressive >>> osd class:rbd_perf advanced >>> bluestore_compression_mode none >>> osd dev bluestore_fsck_quick_fix_on_mount >>> false >>> osd advanced osd_deep_scrub_randomize_ratio >>> 0.000000 >>> osd class:hdd advanced >>> osd_delete_sleep 300.000000 >>> osd advanced osd_fast_shutdown >>> false >>> osd class:fs_meta advanced >>> osd_max_backfills 12 >>> osd class:hdd advanced >>> osd_max_backfills 3 >>> osd class:rbd_data advanced >>> osd_max_backfills 6 >>> osd class:rbd_meta advanced >>> osd_max_backfills 12 >>> osd class:rbd_perf advanced >>> osd_max_backfills 12 >>> osd class:ssd advanced >>> osd_max_backfills 12 >>> osd advanced osd_max_backfills 3 >>> osd class:fs_meta dev >>> osd_memory_cache_min 2147483648 >>> osd class:hdd dev >>> osd_memory_cache_min 1073741824 >>> osd class:rbd_data dev >>> osd_memory_cache_min 2147483648 >>> osd class:rbd_meta dev >>> osd_memory_cache_min 1073741824 >>> osd class:rbd_perf dev >>> osd_memory_cache_min 2147483648 >>> osd class:ssd dev >>> osd_memory_cache_min 2147483648 >>> osd dev osd_memory_cache_min >>> 805306368 >>> osd class:fs_meta basic >>> osd_memory_target 6442450944 >>> osd class:hdd basic >>> osd_memory_target 3221225472 >>> osd class:rbd_data basic >>> osd_memory_target 4294967296 >>> osd class:rbd_meta basic >>> osd_memory_target 2147483648 >>> osd class:rbd_perf basic >>> osd_memory_target 6442450944 >>> osd class:ssd basic >>> osd_memory_target 4294967296 >>> osd basic osd_memory_target >>> 2147483648 >>> osd class:rbd_perf advanced >>> osd_op_num_threads_per_shard 4 * >>> osd class:hdd advanced >>> osd_recovery_delay_start 600.000000 >>> osd class:rbd_data advanced >>> osd_recovery_delay_start 300.000000 >>> osd class:rbd_perf advanced >>> osd_recovery_delay_start 300.000000 >>> osd class:fs_meta advanced >>> osd_recovery_max_active 32 >>> osd class:hdd advanced >>> osd_recovery_max_active 8 >>> osd class:rbd_data advanced >>> osd_recovery_max_active 16 >>> osd class:rbd_meta advanced >>> osd_recovery_max_active 32 >>> osd class:rbd_perf advanced >>> osd_recovery_max_active 16 >>> osd class:ssd advanced >>> osd_recovery_max_active 32 >>> osd advanced osd_recovery_max_active 8 >>> osd class:fs_meta advanced >>> osd_recovery_sleep 0.002500 >>> osd class:hdd advanced >>> osd_recovery_sleep 0.050000 >>> osd class:rbd_data advanced >>> osd_recovery_sleep 0.025000 >>> osd class:rbd_meta advanced >>> osd_recovery_sleep 0.002500 >>> osd class:rbd_perf advanced >>> osd_recovery_sleep 0.010000 >>> osd class:ssd advanced >>> osd_recovery_sleep 0.002500 >>> osd advanced osd_recovery_sleep >>> 0.050000 >>> osd class:hdd dev >>> osd_scrub_backoff_ratio 0.330000 >>> osd class:hdd advanced >>> osd_scrub_during_recovery true >>> osd advanced osd_scrub_load_threshold >>> 0.750000 >>> osd class:fs_meta advanced >>> osd_snap_trim_sleep 0.050000 >>> osd class:hdd advanced >>> osd_snap_trim_sleep 2.000000 >>> osd class:rbd_data advanced >>> osd_snap_trim_sleep 0.100000 >>> mds basic client_cache_size >>> 8192 >>> mds advanced defer_client_eviction_on_laggy_osds >>> false >>> mds advanced mds_bal_fragment_size_max >>> 100000 >>> mds basic mds_cache_memory_limit >>> 25769803776 >>> mds advanced mds_cache_reservation >>> 0.500000 >>> mds advanced mds_max_caps_per_client >>> 65536 >>> mds advanced mds_min_caps_per_client >>> 4096 >>> mds advanced mds_recall_max_caps >>> 32768 >>> mds advanced mds_session_blocklist_on_timeout >>> false >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Bailey Allison <ballison@xxxxxxxxxxxx> >>> Sent: Thursday, January 16, 2025 10:08 PM >>> To: ceph-users@xxxxxxx >>> Subject: Re: MDS hung in purge_stale_snap_data after >>> populating cache >>> >>> Frank, >>> >>> Are you able to share an update to date ceph config dump and ceph daemon >>> mds.X perf dump | grep strays from the cluster? >>> >>> We're just getting through our comically long ceph outage, so i'd like >>> to be able to share the love here hahahaha >>> >>> Regards, >>> >>> Bailey Allison >>> Service Team Lead >>> 45Drives, Ltd. >>> 866-594-7199 x868 >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx