Re: MDS hung in purge_stale_snap_data after populating cache

Eugen Block <eblock@xxxxxx> · Mon, 20 Jan 2025 10:12:54 +0000

Hi Frank,

are you able to query the daemon while it's trying to purge the snaps?

pacific:~ # ceph tell mds.{your_daemon} perf dump throttle-write_buf_throttle
...
        "max": 3758096384,

I don't know yet where that "max" setting comes from, but I'll keep looking.

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

we tracked the deadlock down to line  
https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583  
in Journaler::append_entry(bufferlist& bl):

  // append
  size_t delta = bl.length() + journal_stream.get_envelope_size();
  // write_buf space is nearly full
  if (!write_buf_throttle.get_or_fail(delta)) {
    l.unlock();
    ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
    write_buf_throttle.get(delta);  //<<<<<<<<< The MDS is stuck  
here <<<<<<<<<
    l.lock();
  }
  ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;

This is indicated by the last message in the log before the lock up,  
which reads

  mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101

and is generated by the line above the call  
write_buf_throttle.get(delta). All log messages messages before  
start with "write_buf_throttle get, delta", which means these did  
not go into the if-statement.

Obvious question is, which parameter influences the maximum size of  
the variable Journaler::write_buffer  
(https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the  
class definition of class Journaler? Increasing this limit should  
get us past the deadlock.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Friday, January 17, 2025 3:02 PM
To: Bailey Allison; ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after  
populating cache

Hi Bailey.

ceph-14 (rank=0): num_stray=205532
ceph-13 (rank=1): num_stray=4446
ceph-21-mds (rank=2): num_stray=99446249
ceph-23 (rank=3): num_stray=3412
ceph-08 (rank=4): num_stray=1238
ceph-15 (rank=5): num_stray=1486
ceph-16 (rank=6): num_stray=5545
ceph-11 (rank=7): num_stray=2995

The stats for rank 2 are almost certainly out of date though. The  
config dump is large, but since you asked. Its only 3 settings that  
are present for maintenance and workaround reasons:  
mds_beacon_grace, auth_service_ticket_ttl and  
mon_osd_report_timeout. The last is for a different issue though.

WHO     MASK            LEVEL                                         
   OPTION                        VALUE       RO
global  advanced        auth_service_ticket_ttl                       
   129600.000000
global  advanced        mds_beacon_grace                              
   1209600.000000
global  advanced        mon_pool_quota_crit_threshold                   90
global  advanced        mon_pool_quota_warn_threshold                   70
global  dev             mon_warn_on_pool_pg_num_not_power_of_two        false
global  advanced        osd_map_message_max_bytes                       16384
global  advanced        osd_op_queue                                  
   wpq                           *
global  advanced        osd_op_queue_cut_off                          
   high                          *
global  advanced        osd_pool_default_pg_autoscale_mode              off
mon     advanced        mon_allow_pool_delete                           false
mon     advanced        mon_osd_down_out_subtree_limit                  host
mon     advanced        mon_osd_min_down_reporters                      3
mon     advanced        mon_osd_report_timeout                          86400
mon     advanced        mon_osd_reporter_subtree_level                  host
mon     advanced        mon_pool_quota_warn_threshold                   70
mon     advanced        mon_sync_max_payload_size                       4096
mon     advanced        mon_warn_on_insecure_global_id_reclaim          false
mon     advanced        mon_warn_on_insecure_global_id_reclaim_allowed  false
mgr     advanced        mgr/balancer/active                             false
mgr     advanced        mgr/dashboard/ceph-01/server_addr             
   10.40.88.65                   *
mgr     advanced        mgr/dashboard/ceph-02/server_addr             
   10.40.88.66                   *
mgr     advanced        mgr/dashboard/ceph-03/server_addr             
   10.40.88.67                   *
mgr     advanced        mgr/dashboard/server_port                     
   8443                          *
mgr     advanced        mon_pg_warn_max_object_skew                   
   10.000000
mgr     basic           target_max_misplaced_ratio                    
   1.000000
osd     advanced        bluefs_buffered_io                              true
osd     advanced        bluestore_compression_min_blob_size_hdd       
   262144
osd     advanced        bluestore_compression_min_blob_size_ssd         65536
osd     advanced        bluestore_compression_mode                    
   aggressive
osd     class:rbd_perf  advanced                                      
   bluestore_compression_mode    none
osd     dev             bluestore_fsck_quick_fix_on_mount               false
osd     advanced        osd_deep_scrub_randomize_ratio                
   0.000000
osd     class:hdd       advanced                                      
   osd_delete_sleep              300.000000
osd     advanced        osd_fast_shutdown                               false
osd     class:fs_meta   advanced                                      
   osd_max_backfills             12
osd     class:hdd       advanced                                      
   osd_max_backfills             3
osd     class:rbd_data  advanced                                      
   osd_max_backfills             6
osd     class:rbd_meta  advanced                                      
   osd_max_backfills             12
osd     class:rbd_perf  advanced                                      
   osd_max_backfills             12
osd     class:ssd       advanced                                      
   osd_max_backfills             12
osd     advanced        osd_max_backfills                               3
osd     class:fs_meta   dev                                           
   osd_memory_cache_min          2147483648
osd     class:hdd       dev                                           
   osd_memory_cache_min          1073741824
osd     class:rbd_data  dev                                           
   osd_memory_cache_min          2147483648
osd     class:rbd_meta  dev                                           
   osd_memory_cache_min          1073741824
osd     class:rbd_perf  dev                                           
   osd_memory_cache_min          2147483648
osd     class:ssd       dev                                           
   osd_memory_cache_min          2147483648
osd     dev             osd_memory_cache_min                          
   805306368
osd     class:fs_meta   basic                                         
   osd_memory_target             6442450944
osd     class:hdd       basic                                         
   osd_memory_target             3221225472
osd     class:rbd_data  basic                                         
   osd_memory_target             4294967296
osd     class:rbd_meta  basic                                         
   osd_memory_target             2147483648
osd     class:rbd_perf  basic                                         
   osd_memory_target             6442450944
osd     class:ssd       basic                                         
   osd_memory_target             4294967296
osd     basic           osd_memory_target                             
   2147483648
osd     class:rbd_perf  advanced                                      
   osd_op_num_threads_per_shard  4           *
osd     class:hdd       advanced                                      
   osd_recovery_delay_start      600.000000
osd     class:rbd_data  advanced                                      
   osd_recovery_delay_start      300.000000
osd     class:rbd_perf  advanced                                      
   osd_recovery_delay_start      300.000000
osd     class:fs_meta   advanced                                      
   osd_recovery_max_active       32
osd     class:hdd       advanced                                      
   osd_recovery_max_active       8
osd     class:rbd_data  advanced                                      
   osd_recovery_max_active       16
osd     class:rbd_meta  advanced                                      
   osd_recovery_max_active       32
osd     class:rbd_perf  advanced                                      
   osd_recovery_max_active       16
osd     class:ssd       advanced                                      
   osd_recovery_max_active       32
osd     advanced        osd_recovery_max_active                         8
osd     class:fs_meta   advanced                                      
   osd_recovery_sleep            0.002500
osd     class:hdd       advanced                                      
   osd_recovery_sleep            0.050000
osd     class:rbd_data  advanced                                      
   osd_recovery_sleep            0.025000
osd     class:rbd_meta  advanced                                      
   osd_recovery_sleep            0.002500
osd     class:rbd_perf  advanced                                      
   osd_recovery_sleep            0.010000
osd     class:ssd       advanced                                      
   osd_recovery_sleep            0.002500
osd     advanced        osd_recovery_sleep                            
   0.050000
osd     class:hdd       dev                                           
   osd_scrub_backoff_ratio       0.330000
osd     class:hdd       advanced                                      
   osd_scrub_during_recovery     true
osd     advanced        osd_scrub_load_threshold                      
   0.750000
osd     class:fs_meta   advanced                                      
   osd_snap_trim_sleep           0.050000
osd     class:hdd       advanced                                      
   osd_snap_trim_sleep           2.000000
osd     class:rbd_data  advanced                                      
   osd_snap_trim_sleep           0.100000
mds     basic           client_cache_size                               8192
mds     advanced        defer_client_eviction_on_laggy_osds             false
mds     advanced        mds_bal_fragment_size_max                     
   100000
mds     basic           mds_cache_memory_limit                        
   25769803776
mds     advanced        mds_cache_reservation                         
   0.500000
mds     advanced        mds_max_caps_per_client                         65536
mds     advanced        mds_min_caps_per_client                         4096
mds     advanced        mds_recall_max_caps                             32768
mds     advanced        mds_session_blocklist_on_timeout                false

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Bailey Allison <ballison@xxxxxxxxxxxx>
Sent: Thursday, January 16, 2025 10:08 PM
To: ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after  
populating cache

Frank,

Are you able to share an update to date ceph config dump and ceph daemon
mds.X perf dump | grep strays from the cluster?

We're just getting through our comically long ceph outage, so i'd like
to be able to share the love here hahahaha

Regards,

Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx