Hi Frank,
are you able to query the daemon while it's trying to purge the snaps?
pacific:~ # ceph tell mds.{your_daemon} perf dump throttle-write_buf_throttle
...
"max": 3758096384,
I don't know yet where that "max" setting comes from, but I'll keep looking.
Zitat von Frank Schilder <frans@xxxxxx>:
Hi all,
we tracked the deadlock down to line
https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583
in Journaler::append_entry(bufferlist& bl):
// append
size_t delta = bl.length() + journal_stream.get_envelope_size();
// write_buf space is nearly full
if (!write_buf_throttle.get_or_fail(delta)) {
l.unlock();
ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
write_buf_throttle.get(delta); //<<<<<<<<< The MDS is stuck
here <<<<<<<<<
l.lock();
}
ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;
This is indicated by the last message in the log before the lock up,
which reads
mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101
and is generated by the line above the call
write_buf_throttle.get(delta). All log messages messages before
start with "write_buf_throttle get, delta", which means these did
not go into the if-statement.
Obvious question is, which parameter influences the maximum size of
the variable Journaler::write_buffer
(https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the
class definition of class Journaler? Increasing this limit should
get us past the deadlock.
Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Friday, January 17, 2025 3:02 PM
To: Bailey Allison; ceph-users@xxxxxxx
Subject: Re: MDS hung in purge_stale_snap_data after
populating cache
Hi Bailey.
ceph-14 (rank=0): num_stray=205532
ceph-13 (rank=1): num_stray=4446
ceph-21-mds (rank=2): num_stray=99446249
ceph-23 (rank=3): num_stray=3412
ceph-08 (rank=4): num_stray=1238
ceph-15 (rank=5): num_stray=1486
ceph-16 (rank=6): num_stray=5545
ceph-11 (rank=7): num_stray=2995
The stats for rank 2 are almost certainly out of date though. The
config dump is large, but since you asked. Its only 3 settings that
are present for maintenance and workaround reasons:
mds_beacon_grace, auth_service_ticket_ttl and
mon_osd_report_timeout. The last is for a different issue though.
WHO MASK LEVEL
OPTION VALUE RO
global advanced auth_service_ticket_ttl
129600.000000
global advanced mds_beacon_grace
1209600.000000
global advanced mon_pool_quota_crit_threshold 90
global advanced mon_pool_quota_warn_threshold 70
global dev mon_warn_on_pool_pg_num_not_power_of_two false
global advanced osd_map_message_max_bytes 16384
global advanced osd_op_queue
wpq *
global advanced osd_op_queue_cut_off
high *
global advanced osd_pool_default_pg_autoscale_mode off
mon advanced mon_allow_pool_delete false
mon advanced mon_osd_down_out_subtree_limit host
mon advanced mon_osd_min_down_reporters 3
mon advanced mon_osd_report_timeout 86400
mon advanced mon_osd_reporter_subtree_level host
mon advanced mon_pool_quota_warn_threshold 70
mon advanced mon_sync_max_payload_size 4096
mon advanced mon_warn_on_insecure_global_id_reclaim false
mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false
mgr advanced mgr/balancer/active false
mgr advanced mgr/dashboard/ceph-01/server_addr
10.40.88.65 *
mgr advanced mgr/dashboard/ceph-02/server_addr
10.40.88.66 *
mgr advanced mgr/dashboard/ceph-03/server_addr
10.40.88.67 *
mgr advanced mgr/dashboard/server_port
8443 *
mgr advanced mon_pg_warn_max_object_skew
10.000000
mgr basic target_max_misplaced_ratio
1.000000
osd advanced bluefs_buffered_io true
osd advanced bluestore_compression_min_blob_size_hdd
262144
osd advanced bluestore_compression_min_blob_size_ssd 65536
osd advanced bluestore_compression_mode
aggressive
osd class:rbd_perf advanced
bluestore_compression_mode none
osd dev bluestore_fsck_quick_fix_on_mount false
osd advanced osd_deep_scrub_randomize_ratio
0.000000
osd class:hdd advanced
osd_delete_sleep 300.000000
osd advanced osd_fast_shutdown false
osd class:fs_meta advanced
osd_max_backfills 12
osd class:hdd advanced
osd_max_backfills 3
osd class:rbd_data advanced
osd_max_backfills 6
osd class:rbd_meta advanced
osd_max_backfills 12
osd class:rbd_perf advanced
osd_max_backfills 12
osd class:ssd advanced
osd_max_backfills 12
osd advanced osd_max_backfills 3
osd class:fs_meta dev
osd_memory_cache_min 2147483648
osd class:hdd dev
osd_memory_cache_min 1073741824
osd class:rbd_data dev
osd_memory_cache_min 2147483648
osd class:rbd_meta dev
osd_memory_cache_min 1073741824
osd class:rbd_perf dev
osd_memory_cache_min 2147483648
osd class:ssd dev
osd_memory_cache_min 2147483648
osd dev osd_memory_cache_min
805306368
osd class:fs_meta basic
osd_memory_target 6442450944
osd class:hdd basic
osd_memory_target 3221225472
osd class:rbd_data basic
osd_memory_target 4294967296
osd class:rbd_meta basic
osd_memory_target 2147483648
osd class:rbd_perf basic
osd_memory_target 6442450944
osd class:ssd basic
osd_memory_target 4294967296
osd basic osd_memory_target
2147483648
osd class:rbd_perf advanced
osd_op_num_threads_per_shard 4 *
osd class:hdd advanced
osd_recovery_delay_start 600.000000
osd class:rbd_data advanced
osd_recovery_delay_start 300.000000
osd class:rbd_perf advanced
osd_recovery_delay_start 300.000000
osd class:fs_meta advanced
osd_recovery_max_active 32
osd class:hdd advanced
osd_recovery_max_active 8
osd class:rbd_data advanced
osd_recovery_max_active 16
osd class:rbd_meta advanced
osd_recovery_max_active 32
osd class:rbd_perf advanced
osd_recovery_max_active 16
osd class:ssd advanced
osd_recovery_max_active 32
osd advanced osd_recovery_max_active 8
osd class:fs_meta advanced
osd_recovery_sleep 0.002500
osd class:hdd advanced
osd_recovery_sleep 0.050000
osd class:rbd_data advanced
osd_recovery_sleep 0.025000
osd class:rbd_meta advanced
osd_recovery_sleep 0.002500
osd class:rbd_perf advanced
osd_recovery_sleep 0.010000
osd class:ssd advanced
osd_recovery_sleep 0.002500
osd advanced osd_recovery_sleep
0.050000
osd class:hdd dev
osd_scrub_backoff_ratio 0.330000
osd class:hdd advanced
osd_scrub_during_recovery true
osd advanced osd_scrub_load_threshold
0.750000
osd class:fs_meta advanced
osd_snap_trim_sleep 0.050000
osd class:hdd advanced
osd_snap_trim_sleep 2.000000
osd class:rbd_data advanced
osd_snap_trim_sleep 0.100000
mds basic client_cache_size 8192
mds advanced defer_client_eviction_on_laggy_osds false
mds advanced mds_bal_fragment_size_max
100000
mds basic mds_cache_memory_limit
25769803776
mds advanced mds_cache_reservation
0.500000
mds advanced mds_max_caps_per_client 65536
mds advanced mds_min_caps_per_client 4096
mds advanced mds_recall_max_caps 32768
mds advanced mds_session_blocklist_on_timeout false
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Bailey Allison <ballison@xxxxxxxxxxxx>
Sent: Thursday, January 16, 2025 10:08 PM
To: ceph-users@xxxxxxx
Subject: Re: MDS hung in purge_stale_snap_data after
populating cache
Frank,
Are you able to share an update to date ceph config dump and ceph daemon
mds.X perf dump | grep strays from the cluster?
We're just getting through our comically long ceph outage, so i'd like
to be able to share the love here hahahaha
Regards,
Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx