Re: MDS hung in purge_stale_snap_data after populating cache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Frank,

are you able to query the daemon while it's trying to purge the snaps?

pacific:~ # ceph tell mds.{your_daemon} perf dump throttle-write_buf_throttle
...
        "max": 3758096384,

I don't know yet where that "max" setting comes from, but I'll keep looking.

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

we tracked the deadlock down to line https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583 in Journaler::append_entry(bufferlist& bl):

  // append
  size_t delta = bl.length() + journal_stream.get_envelope_size();
  // write_buf space is nearly full
  if (!write_buf_throttle.get_or_fail(delta)) {
    l.unlock();
    ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
write_buf_throttle.get(delta); //<<<<<<<<< The MDS is stuck here <<<<<<<<<
    l.lock();
  }
  ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;

This is indicated by the last message in the log before the lock up, which reads

  mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101

and is generated by the line above the call write_buf_throttle.get(delta). All log messages messages before start with "write_buf_throttle get, delta", which means these did not go into the if-statement.

Obvious question is, which parameter influences the maximum size of the variable Journaler::write_buffer (https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the class definition of class Journaler? Increasing this limit should get us past the deadlock.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Friday, January 17, 2025 3:02 PM
To: Bailey Allison; ceph-users@xxxxxxx
Subject: Re: MDS hung in purge_stale_snap_data after populating cache

Hi Bailey.

ceph-14 (rank=0): num_stray=205532
ceph-13 (rank=1): num_stray=4446
ceph-21-mds (rank=2): num_stray=99446249
ceph-23 (rank=3): num_stray=3412
ceph-08 (rank=4): num_stray=1238
ceph-15 (rank=5): num_stray=1486
ceph-16 (rank=6): num_stray=5545
ceph-11 (rank=7): num_stray=2995

The stats for rank 2 are almost certainly out of date though. The config dump is large, but since you asked. Its only 3 settings that are present for maintenance and workaround reasons: mds_beacon_grace, auth_service_ticket_ttl and mon_osd_report_timeout. The last is for a different issue though.

WHO MASK LEVEL OPTION VALUE RO global advanced auth_service_ticket_ttl 129600.000000 global advanced mds_beacon_grace 1209600.000000
global  advanced        mon_pool_quota_crit_threshold                   90
global  advanced        mon_pool_quota_warn_threshold                   70
global  dev             mon_warn_on_pool_pg_num_not_power_of_two        false
global  advanced        osd_map_message_max_bytes                       16384
global advanced osd_op_queue wpq * global advanced osd_op_queue_cut_off high *
global  advanced        osd_pool_default_pg_autoscale_mode              off
mon     advanced        mon_allow_pool_delete                           false
mon     advanced        mon_osd_down_out_subtree_limit                  host
mon     advanced        mon_osd_min_down_reporters                      3
mon     advanced        mon_osd_report_timeout                          86400
mon     advanced        mon_osd_reporter_subtree_level                  host
mon     advanced        mon_pool_quota_warn_threshold                   70
mon     advanced        mon_sync_max_payload_size                       4096
mon     advanced        mon_warn_on_insecure_global_id_reclaim          false
mon     advanced        mon_warn_on_insecure_global_id_reclaim_allowed  false
mgr     advanced        mgr/balancer/active                             false
mgr advanced mgr/dashboard/ceph-01/server_addr 10.40.88.65 * mgr advanced mgr/dashboard/ceph-02/server_addr 10.40.88.66 * mgr advanced mgr/dashboard/ceph-03/server_addr 10.40.88.67 * mgr advanced mgr/dashboard/server_port 8443 * mgr advanced mon_pg_warn_max_object_skew 10.000000 mgr basic target_max_misplaced_ratio 1.000000
osd     advanced        bluefs_buffered_io                              true
osd advanced bluestore_compression_min_blob_size_hdd 262144
osd     advanced        bluestore_compression_min_blob_size_ssd         65536
osd advanced bluestore_compression_mode aggressive osd class:rbd_perf advanced bluestore_compression_mode none
osd     dev             bluestore_fsck_quick_fix_on_mount               false
osd advanced osd_deep_scrub_randomize_ratio 0.000000 osd class:hdd advanced osd_delete_sleep 300.000000
osd     advanced        osd_fast_shutdown                               false
osd class:fs_meta advanced osd_max_backfills 12 osd class:hdd advanced osd_max_backfills 3 osd class:rbd_data advanced osd_max_backfills 6 osd class:rbd_meta advanced osd_max_backfills 12 osd class:rbd_perf advanced osd_max_backfills 12 osd class:ssd advanced osd_max_backfills 12
osd     advanced        osd_max_backfills                               3
osd class:fs_meta dev osd_memory_cache_min 2147483648 osd class:hdd dev osd_memory_cache_min 1073741824 osd class:rbd_data dev osd_memory_cache_min 2147483648 osd class:rbd_meta dev osd_memory_cache_min 1073741824 osd class:rbd_perf dev osd_memory_cache_min 2147483648 osd class:ssd dev osd_memory_cache_min 2147483648 osd dev osd_memory_cache_min 805306368 osd class:fs_meta basic osd_memory_target 6442450944 osd class:hdd basic osd_memory_target 3221225472 osd class:rbd_data basic osd_memory_target 4294967296 osd class:rbd_meta basic osd_memory_target 2147483648 osd class:rbd_perf basic osd_memory_target 6442450944 osd class:ssd basic osd_memory_target 4294967296 osd basic osd_memory_target 2147483648 osd class:rbd_perf advanced osd_op_num_threads_per_shard 4 * osd class:hdd advanced osd_recovery_delay_start 600.000000 osd class:rbd_data advanced osd_recovery_delay_start 300.000000 osd class:rbd_perf advanced osd_recovery_delay_start 300.000000 osd class:fs_meta advanced osd_recovery_max_active 32 osd class:hdd advanced osd_recovery_max_active 8 osd class:rbd_data advanced osd_recovery_max_active 16 osd class:rbd_meta advanced osd_recovery_max_active 32 osd class:rbd_perf advanced osd_recovery_max_active 16 osd class:ssd advanced osd_recovery_max_active 32
osd     advanced        osd_recovery_max_active                         8
osd class:fs_meta advanced osd_recovery_sleep 0.002500 osd class:hdd advanced osd_recovery_sleep 0.050000 osd class:rbd_data advanced osd_recovery_sleep 0.025000 osd class:rbd_meta advanced osd_recovery_sleep 0.002500 osd class:rbd_perf advanced osd_recovery_sleep 0.010000 osd class:ssd advanced osd_recovery_sleep 0.002500 osd advanced osd_recovery_sleep 0.050000 osd class:hdd dev osd_scrub_backoff_ratio 0.330000 osd class:hdd advanced osd_scrub_during_recovery true osd advanced osd_scrub_load_threshold 0.750000 osd class:fs_meta advanced osd_snap_trim_sleep 0.050000 osd class:hdd advanced osd_snap_trim_sleep 2.000000 osd class:rbd_data advanced osd_snap_trim_sleep 0.100000
mds     basic           client_cache_size                               8192
mds     advanced        defer_client_eviction_on_laggy_osds             false
mds advanced mds_bal_fragment_size_max 100000 mds basic mds_cache_memory_limit 25769803776 mds advanced mds_cache_reservation 0.500000
mds     advanced        mds_max_caps_per_client                         65536
mds     advanced        mds_min_caps_per_client                         4096
mds     advanced        mds_recall_max_caps                             32768
mds     advanced        mds_session_blocklist_on_timeout                false

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Bailey Allison <ballison@xxxxxxxxxxxx>
Sent: Thursday, January 16, 2025 10:08 PM
To: ceph-users@xxxxxxx
Subject: Re: MDS hung in purge_stale_snap_data after populating cache

Frank,

Are you able to share an update to date ceph config dump and ceph daemon
mds.X perf dump | grep strays from the cluster?

We're just getting through our comically long ceph outage, so i'd like
to be able to share the love here hahahaha

Regards,

Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux