Re: MDS hung in purge_stale_snap_data after populating cache

Eugen Block <eblock@xxxxxx> · Mon, 20 Jan 2025 12:25:23 +0000

Hi,

right, I haven't found a parameter for this to tune. Some throttling  
parameters are tunable, though. For example when I created  
https://tracker.ceph.com/issues/66310, where I assume that the default  
for mgr_mon_messages is too low (which shows up as  
throttle-mgr_mon_messsages in the perf dump). But you can't make  
everything configurable, I guess.
I have no idea if skipping is possible, I've been also looking for all  
kinds of mds related config parameters, but it's not always clear what  
they are for. So fingers are crossed that you get out of that quickly.

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen,

yeah, I think you found it. That would also mean there is no  
parameter to scale that. I wonder if it is possible to skip the  
initial run of purge_stale_snap_data, have a lot of trash in the  
cache and use the forward-scrub to deal with the stray items.

Well, we got in touch with some companies offering emergency support  
and hope this can be fixed with reasonable effort and time.

Thanks for your help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Monday, January 20, 2025 12:40 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: MDS hung in purge_stale_snap_data  
after populating cache

It looks like a hard-coded max for the throttle:

write_buf_throttle(cct, "write_buf_throttle", UINT_MAX - (UINT_MAX >> 3)),

which is 3758096384. I'm not even sure what the unit is, probably bytes?

https://github.com/ceph/ceph/blob/v16.2.15/src/osdc/Journaler.h#L410

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen,

thanks for your input. I can't query the hung MDS, but the others
say this here:

ceph tell mds.ceph-14 perf dump throttle-write_buf_throttle
{
    "throttle-write_buf_throttle": {
        "val": 0,
        "max": 3758096384,
        "get_started": 0,
        "get": 5199,
        "get_sum": 566691,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 5199,
        "take": 0,
        "take_sum": 0,
        "put": 719,
        "put_sum": 566691,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    }
}

You might be on to something, we are also trying to find where this
limit comes from.

Please keep us posted.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Monday, January 20, 2025 11:12 AM
To: ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after
populating cache

Hi Frank,

are you able to query the daemon while it's trying to purge the snaps?

pacific:~ # ceph tell mds.{your_daemon} perf dump  
throttle-write_buf_throttle
...
         "max": 3758096384,

I don't know yet where that "max" setting comes from, but I'll keep looking.

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

we tracked the deadlock down to line
https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583
in Journaler::append_entry(bufferlist& bl):

  // append
  size_t delta = bl.length() + journal_stream.get_envelope_size();
  // write_buf space is nearly full
  if (!write_buf_throttle.get_or_fail(delta)) {
    l.unlock();
    ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
    write_buf_throttle.get(delta);  //<<<<<<<<< The MDS is stuck
here <<<<<<<<<
    l.lock();
  }
  ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;

This is indicated by the last message in the log before the lock up,
which reads

  mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101

and is generated by the line above the call
write_buf_throttle.get(delta). All log messages messages before
start with "write_buf_throttle get, delta", which means these did
not go into the if-statement.

Obvious question is, which parameter influences the maximum size of
the variable Journaler::write_buffer
(https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in  
the
class definition of class Journaler? Increasing this limit should
get us past the deadlock.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Friday, January 17, 2025 3:02 PM
To: Bailey Allison; ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after
populating cache

Hi Bailey.

ceph-14 (rank=0): num_stray=205532
ceph-13 (rank=1): num_stray=4446
ceph-21-mds (rank=2): num_stray=99446249
ceph-23 (rank=3): num_stray=3412
ceph-08 (rank=4): num_stray=1238
ceph-15 (rank=5): num_stray=1486
ceph-16 (rank=6): num_stray=5545
ceph-11 (rank=7): num_stray=2995

The stats for rank 2 are almost certainly out of date though. The
config dump is large, but since you asked. Its only 3 settings that
are present for maintenance and workaround reasons:
mds_beacon_grace, auth_service_ticket_ttl and
mon_osd_report_timeout. The last is for a different issue though.

WHO     MASK            LEVEL
   OPTION                        VALUE       RO
global  advanced        auth_service_ticket_ttl
   129600.000000
global  advanced        mds_beacon_grace
   1209600.000000
global  advanced        mon_pool_quota_crit_threshold                   90
global  advanced        mon_pool_quota_warn_threshold                   70
global  dev             mon_warn_on_pool_pg_num_not_power_of_two
    false
global  advanced        osd_map_message_max_bytes
    16384
global  advanced        osd_op_queue
   wpq                           *
global  advanced        osd_op_queue_cut_off
   high                          *
global  advanced        osd_pool_default_pg_autoscale_mode              off
mon     advanced        mon_allow_pool_delete
    false
mon     advanced        mon_osd_down_out_subtree_limit              
     host
mon     advanced        mon_osd_min_down_reporters                      3
mon     advanced        mon_osd_report_timeout
    86400
mon     advanced        mon_osd_reporter_subtree_level              
     host
mon     advanced        mon_pool_quota_warn_threshold                   70
mon     advanced        mon_sync_max_payload_size                   
     4096
mon     advanced        mon_warn_on_insecure_global_id_reclaim
    false
mon     advanced
mon_warn_on_insecure_global_id_reclaim_allowed  false
mgr     advanced        mgr/balancer/active
    false
mgr     advanced        mgr/dashboard/ceph-01/server_addr
   10.40.88.65                   *
mgr     advanced        mgr/dashboard/ceph-02/server_addr
   10.40.88.66                   *
mgr     advanced        mgr/dashboard/ceph-03/server_addr
   10.40.88.67                   *
mgr     advanced        mgr/dashboard/server_port
   8443                          *
mgr     advanced        mon_pg_warn_max_object_skew
   10.000000
mgr     basic           target_max_misplaced_ratio
   1.000000
osd     advanced        bluefs_buffered_io                          
     true
osd     advanced        bluestore_compression_min_blob_size_hdd
   262144
osd     advanced        bluestore_compression_min_blob_size_ssd
    65536
osd     advanced        bluestore_compression_mode
   aggressive
osd     class:rbd_perf  advanced
   bluestore_compression_mode    none
osd     dev             bluestore_fsck_quick_fix_on_mount
    false
osd     advanced        osd_deep_scrub_randomize_ratio
   0.000000
osd     class:hdd       advanced
   osd_delete_sleep              300.000000
osd     advanced        osd_fast_shutdown
    false
osd     class:fs_meta   advanced
   osd_max_backfills             12
osd     class:hdd       advanced
   osd_max_backfills             3
osd     class:rbd_data  advanced
   osd_max_backfills             6
osd     class:rbd_meta  advanced
   osd_max_backfills             12
osd     class:rbd_perf  advanced
   osd_max_backfills             12
osd     class:ssd       advanced
   osd_max_backfills             12
osd     advanced        osd_max_backfills                               3
osd     class:fs_meta   dev
   osd_memory_cache_min          2147483648
osd     class:hdd       dev
   osd_memory_cache_min          1073741824
osd     class:rbd_data  dev
   osd_memory_cache_min          2147483648
osd     class:rbd_meta  dev
   osd_memory_cache_min          1073741824
osd     class:rbd_perf  dev
   osd_memory_cache_min          2147483648
osd     class:ssd       dev
   osd_memory_cache_min          2147483648
osd     dev             osd_memory_cache_min
   805306368
osd     class:fs_meta   basic
   osd_memory_target             6442450944
osd     class:hdd       basic
   osd_memory_target             3221225472
osd     class:rbd_data  basic
   osd_memory_target             4294967296
osd     class:rbd_meta  basic
   osd_memory_target             2147483648
osd     class:rbd_perf  basic
   osd_memory_target             6442450944
osd     class:ssd       basic
   osd_memory_target             4294967296
osd     basic           osd_memory_target
   2147483648
osd     class:rbd_perf  advanced
   osd_op_num_threads_per_shard  4           *
osd     class:hdd       advanced
   osd_recovery_delay_start      600.000000
osd     class:rbd_data  advanced
   osd_recovery_delay_start      300.000000
osd     class:rbd_perf  advanced
   osd_recovery_delay_start      300.000000
osd     class:fs_meta   advanced
   osd_recovery_max_active       32
osd     class:hdd       advanced
   osd_recovery_max_active       8
osd     class:rbd_data  advanced
   osd_recovery_max_active       16
osd     class:rbd_meta  advanced
   osd_recovery_max_active       32
osd     class:rbd_perf  advanced
   osd_recovery_max_active       16
osd     class:ssd       advanced
   osd_recovery_max_active       32
osd     advanced        osd_recovery_max_active                         8
osd     class:fs_meta   advanced
   osd_recovery_sleep            0.002500
osd     class:hdd       advanced
   osd_recovery_sleep            0.050000
osd     class:rbd_data  advanced
   osd_recovery_sleep            0.025000
osd     class:rbd_meta  advanced
   osd_recovery_sleep            0.002500
osd     class:rbd_perf  advanced
   osd_recovery_sleep            0.010000
osd     class:ssd       advanced
   osd_recovery_sleep            0.002500
osd     advanced        osd_recovery_sleep
   0.050000
osd     class:hdd       dev
   osd_scrub_backoff_ratio       0.330000
osd     class:hdd       advanced
   osd_scrub_during_recovery     true
osd     advanced        osd_scrub_load_threshold
   0.750000
osd     class:fs_meta   advanced
   osd_snap_trim_sleep           0.050000
osd     class:hdd       advanced
   osd_snap_trim_sleep           2.000000
osd     class:rbd_data  advanced
   osd_snap_trim_sleep           0.100000
mds     basic           client_cache_size                           
     8192
mds     advanced        defer_client_eviction_on_laggy_osds
    false
mds     advanced        mds_bal_fragment_size_max
   100000
mds     basic           mds_cache_memory_limit
   25769803776
mds     advanced        mds_cache_reservation
   0.500000
mds     advanced        mds_max_caps_per_client
    65536
mds     advanced        mds_min_caps_per_client                     
     4096
mds     advanced        mds_recall_max_caps
    32768
mds     advanced        mds_session_blocklist_on_timeout
    false

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Bailey Allison <ballison@xxxxxxxxxxxx>
Sent: Thursday, January 16, 2025 10:08 PM
To: ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after
populating cache

Frank,

Are you able to share an update to date ceph config dump and ceph daemon
mds.X perf dump | grep strays from the cluster?

We're just getting through our comically long ceph outage, so i'd like
to be able to share the love here hahahaha

Regards,

Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx