Re: MDS hung in purge_stale_snap_data after populating cache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

we tracked the deadlock down to line https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583 in Journaler::append_entry(bufferlist& bl):

  // append
  size_t delta = bl.length() + journal_stream.get_envelope_size();
  // write_buf space is nearly full
  if (!write_buf_throttle.get_or_fail(delta)) {
    l.unlock();
    ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
    write_buf_throttle.get(delta);  //<<<<<<<<< The MDS is stuck here <<<<<<<<<
    l.lock();
  }
  ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;

This is indicated by the last message in the log before the lock up, which reads

  mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101

and is generated by the line above the call write_buf_throttle.get(delta). All log messages messages before start with "write_buf_throttle get, delta", which means these did not go into the if-statement.

Obvious question is, which parameter influences the maximum size of the variable Journaler::write_buffer (https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the class definition of class Journaler? Increasing this limit should get us past the deadlock.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Friday, January 17, 2025 3:02 PM
To: Bailey Allison; ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after populating cache

Hi Bailey.

ceph-14 (rank=0): num_stray=205532
ceph-13 (rank=1): num_stray=4446
ceph-21-mds (rank=2): num_stray=99446249
ceph-23 (rank=3): num_stray=3412
ceph-08 (rank=4): num_stray=1238
ceph-15 (rank=5): num_stray=1486
ceph-16 (rank=6): num_stray=5545
ceph-11 (rank=7): num_stray=2995

The stats for rank 2 are almost certainly out of date though. The config dump is large, but since you asked. Its only 3 settings that are present for maintenance and workaround reasons: mds_beacon_grace, auth_service_ticket_ttl and mon_osd_report_timeout. The last is for a different issue though.

WHO     MASK            LEVEL                                           OPTION                        VALUE       RO
global  advanced        auth_service_ticket_ttl                         129600.000000
global  advanced        mds_beacon_grace                                1209600.000000
global  advanced        mon_pool_quota_crit_threshold                   90
global  advanced        mon_pool_quota_warn_threshold                   70
global  dev             mon_warn_on_pool_pg_num_not_power_of_two        false
global  advanced        osd_map_message_max_bytes                       16384
global  advanced        osd_op_queue                                    wpq                           *
global  advanced        osd_op_queue_cut_off                            high                          *
global  advanced        osd_pool_default_pg_autoscale_mode              off
mon     advanced        mon_allow_pool_delete                           false
mon     advanced        mon_osd_down_out_subtree_limit                  host
mon     advanced        mon_osd_min_down_reporters                      3
mon     advanced        mon_osd_report_timeout                          86400
mon     advanced        mon_osd_reporter_subtree_level                  host
mon     advanced        mon_pool_quota_warn_threshold                   70
mon     advanced        mon_sync_max_payload_size                       4096
mon     advanced        mon_warn_on_insecure_global_id_reclaim          false
mon     advanced        mon_warn_on_insecure_global_id_reclaim_allowed  false
mgr     advanced        mgr/balancer/active                             false
mgr     advanced        mgr/dashboard/ceph-01/server_addr               10.40.88.65                   *
mgr     advanced        mgr/dashboard/ceph-02/server_addr               10.40.88.66                   *
mgr     advanced        mgr/dashboard/ceph-03/server_addr               10.40.88.67                   *
mgr     advanced        mgr/dashboard/server_port                       8443                          *
mgr     advanced        mon_pg_warn_max_object_skew                     10.000000
mgr     basic           target_max_misplaced_ratio                      1.000000
osd     advanced        bluefs_buffered_io                              true
osd     advanced        bluestore_compression_min_blob_size_hdd         262144
osd     advanced        bluestore_compression_min_blob_size_ssd         65536
osd     advanced        bluestore_compression_mode                      aggressive
osd     class:rbd_perf  advanced                                        bluestore_compression_mode    none
osd     dev             bluestore_fsck_quick_fix_on_mount               false
osd     advanced        osd_deep_scrub_randomize_ratio                  0.000000
osd     class:hdd       advanced                                        osd_delete_sleep              300.000000
osd     advanced        osd_fast_shutdown                               false
osd     class:fs_meta   advanced                                        osd_max_backfills             12
osd     class:hdd       advanced                                        osd_max_backfills             3
osd     class:rbd_data  advanced                                        osd_max_backfills             6
osd     class:rbd_meta  advanced                                        osd_max_backfills             12
osd     class:rbd_perf  advanced                                        osd_max_backfills             12
osd     class:ssd       advanced                                        osd_max_backfills             12
osd     advanced        osd_max_backfills                               3
osd     class:fs_meta   dev                                             osd_memory_cache_min          2147483648
osd     class:hdd       dev                                             osd_memory_cache_min          1073741824
osd     class:rbd_data  dev                                             osd_memory_cache_min          2147483648
osd     class:rbd_meta  dev                                             osd_memory_cache_min          1073741824
osd     class:rbd_perf  dev                                             osd_memory_cache_min          2147483648
osd     class:ssd       dev                                             osd_memory_cache_min          2147483648
osd     dev             osd_memory_cache_min                            805306368
osd     class:fs_meta   basic                                           osd_memory_target             6442450944
osd     class:hdd       basic                                           osd_memory_target             3221225472
osd     class:rbd_data  basic                                           osd_memory_target             4294967296
osd     class:rbd_meta  basic                                           osd_memory_target             2147483648
osd     class:rbd_perf  basic                                           osd_memory_target             6442450944
osd     class:ssd       basic                                           osd_memory_target             4294967296
osd     basic           osd_memory_target                               2147483648
osd     class:rbd_perf  advanced                                        osd_op_num_threads_per_shard  4           *
osd     class:hdd       advanced                                        osd_recovery_delay_start      600.000000
osd     class:rbd_data  advanced                                        osd_recovery_delay_start      300.000000
osd     class:rbd_perf  advanced                                        osd_recovery_delay_start      300.000000
osd     class:fs_meta   advanced                                        osd_recovery_max_active       32
osd     class:hdd       advanced                                        osd_recovery_max_active       8
osd     class:rbd_data  advanced                                        osd_recovery_max_active       16
osd     class:rbd_meta  advanced                                        osd_recovery_max_active       32
osd     class:rbd_perf  advanced                                        osd_recovery_max_active       16
osd     class:ssd       advanced                                        osd_recovery_max_active       32
osd     advanced        osd_recovery_max_active                         8
osd     class:fs_meta   advanced                                        osd_recovery_sleep            0.002500
osd     class:hdd       advanced                                        osd_recovery_sleep            0.050000
osd     class:rbd_data  advanced                                        osd_recovery_sleep            0.025000
osd     class:rbd_meta  advanced                                        osd_recovery_sleep            0.002500
osd     class:rbd_perf  advanced                                        osd_recovery_sleep            0.010000
osd     class:ssd       advanced                                        osd_recovery_sleep            0.002500
osd     advanced        osd_recovery_sleep                              0.050000
osd     class:hdd       dev                                             osd_scrub_backoff_ratio       0.330000
osd     class:hdd       advanced                                        osd_scrub_during_recovery     true
osd     advanced        osd_scrub_load_threshold                        0.750000
osd     class:fs_meta   advanced                                        osd_snap_trim_sleep           0.050000
osd     class:hdd       advanced                                        osd_snap_trim_sleep           2.000000
osd     class:rbd_data  advanced                                        osd_snap_trim_sleep           0.100000
mds     basic           client_cache_size                               8192
mds     advanced        defer_client_eviction_on_laggy_osds             false
mds     advanced        mds_bal_fragment_size_max                       100000
mds     basic           mds_cache_memory_limit                          25769803776
mds     advanced        mds_cache_reservation                           0.500000
mds     advanced        mds_max_caps_per_client                         65536
mds     advanced        mds_min_caps_per_client                         4096
mds     advanced        mds_recall_max_caps                             32768
mds     advanced        mds_session_blocklist_on_timeout                false

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Bailey Allison <ballison@xxxxxxxxxxxx>
Sent: Thursday, January 16, 2025 10:08 PM
To: ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after populating cache

Frank,

Are you able to share an update to date ceph config dump and ceph daemon
mds.X perf dump | grep strays from the cluster?

We're just getting through our comically long ceph outage, so i'd like
to be able to share the love here hahahaha

Regards,

Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux