Re: MDS hung in purge_stale_snap_data after populating cache

Frank Schilder <frans@xxxxxx> · Mon, 20 Jan 2025 12:51:37 +0000

> which is 3758096384. I'm not even sure what the unit is, probably bytes?

As far as I understand the unit is "list items". They can have variable length. On our system about 400G are allocated while filling up the bufferlist.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Monday, January 20, 2025 1:38 PM
To: Eugen Block
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: MDS hung in purge_stale_snap_data after populating cache

Hi Eugen,

I think the default is just a "reasonably large number" that's not too large. Looking at the code line you found:

  write_buf_throttle(cct, "write_buf_throttle", UINT_MAX - (UINT_MAX >> 3)),

my gut feeling is that rebuilding it with this change (factor 4):

  write_buf_throttle(cct, "write_buf_throttle", 4*( UINT_MAX - (UINT_MAX >> 3)) ),

will do the trick for us. The arguments are all int64, so there should be no overflow issues down the line. The factor 4 could also be an ENV variable to be able to restart the MDS with different scalings if required.

The class Throttle does have a reset_max method, but I'm not sure if it is called anywhere and if it is possible to call it and change the max at runtime via things line "ceph daemon" or "ceph tell" in some way.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Monday, January 20, 2025 1:25 PM
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: MDS hung in purge_stale_snap_data after populating cache

Hi,

right, I haven't found a parameter for this to tune. Some throttling
parameters are tunable, though. For example when I created
https://tracker.ceph.com/issues/66310, where I assume that the default
for mgr_mon_messages is too low (which shows up as
throttle-mgr_mon_messsages in the perf dump). But you can't make
everything configurable, I guess.
I have no idea if skipping is possible, I've been also looking for all
kinds of mds related config parameters, but it's not always clear what
they are for. So fingers are crossed that you get out of that quickly.

Zitat von Frank Schilder <frans@xxxxxx>:

> Hi Eugen,
>
> yeah, I think you found it. That would also mean there is no
> parameter to scale that. I wonder if it is possible to skip the
> initial run of purge_stale_snap_data, have a lot of trash in the
> cache and use the forward-scrub to deal with the stray items.
>
> Well, we got in touch with some companies offering emergency support
> and hope this can be fixed with reasonable effort and time.
>
> Thanks for your help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Eugen Block <eblock@xxxxxx>
> Sent: Monday, January 20, 2025 12:40 PM
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  Re: MDS hung in purge_stale_snap_data
> after populating cache
>
> It looks like a hard-coded max for the throttle:
>
> write_buf_throttle(cct, "write_buf_throttle", UINT_MAX - (UINT_MAX >> 3)),
>
> which is 3758096384. I'm not even sure what the unit is, probably bytes?
>
> https://github.com/ceph/ceph/blob/v16.2.15/src/osdc/Journaler.h#L410
>
> Zitat von Frank Schilder <frans@xxxxxx>:
>
>> Hi Eugen,
>>
>> thanks for your input. I can't query the hung MDS, but the others
>> say this here:
>>
>> ceph tell mds.ceph-14 perf dump throttle-write_buf_throttle
>> {
>>     "throttle-write_buf_throttle": {
>>         "val": 0,
>>         "max": 3758096384,
>>         "get_started": 0,
>>         "get": 5199,
>>         "get_sum": 566691,
>>         "get_or_fail_fail": 0,
>>         "get_or_fail_success": 5199,
>>         "take": 0,
>>         "take_sum": 0,
>>         "put": 719,
>>         "put_sum": 566691,
>>         "wait": {
>>             "avgcount": 0,
>>             "sum": 0.000000000,
>>             "avgtime": 0.000000000
>>         }
>>     }
>> }
>>
>> You might be on to something, we are also trying to find where this
>> limit comes from.
>>
>> Please keep us posted.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Eugen Block <eblock@xxxxxx>
>> Sent: Monday, January 20, 2025 11:12 AM
>> To: ceph-users@xxxxxxx
>> Subject:  Re: MDS hung in purge_stale_snap_data after
>> populating cache
>>
>> Hi Frank,
>>
>> are you able to query the daemon while it's trying to purge the snaps?
>>
>> pacific:~ # ceph tell mds.{your_daemon} perf dump
>> throttle-write_buf_throttle
>> ...
>>          "max": 3758096384,
>>
>> I don't know yet where that "max" setting comes from, but I'll keep looking.
>>
>> Zitat von Frank Schilder <frans@xxxxxx>:
>>
>>> Hi all,
>>>
>>> we tracked the deadlock down to line
>>> https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583
>>> in Journaler::append_entry(bufferlist& bl):
>>>
>>>   // append
>>>   size_t delta = bl.length() + journal_stream.get_envelope_size();
>>>   // write_buf space is nearly full
>>>   if (!write_buf_throttle.get_or_fail(delta)) {
>>>     l.unlock();
>>>     ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
>>>     write_buf_throttle.get(delta);  //<<<<<<<<< The MDS is stuck
>>> here <<<<<<<<<
>>>     l.lock();
>>>   }
>>>   ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;
>>>
>>> This is indicated by the last message in the log before the lock up,
>>> which reads
>>>
>>>   mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101
>>>
>>> and is generated by the line above the call
>>> write_buf_throttle.get(delta). All log messages messages before
>>> start with "write_buf_throttle get, delta", which means these did
>>> not go into the if-statement.
>>>
>>> Obvious question is, which parameter influences the maximum size of
>>> the variable Journaler::write_buffer
>>> (https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in
>>> the
>>> class definition of class Journaler? Increasing this limit should
>>> get us past the deadlock.
>>>
>>> Thanks for your help and best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Frank Schilder <frans@xxxxxx>
>>> Sent: Friday, January 17, 2025 3:02 PM
>>> To: Bailey Allison; ceph-users@xxxxxxx
>>> Subject:  Re: MDS hung in purge_stale_snap_data after
>>> populating cache
>>>
>>> Hi Bailey.
>>>
>>> ceph-14 (rank=0): num_stray=205532
>>> ceph-13 (rank=1): num_stray=4446
>>> ceph-21-mds (rank=2): num_stray=99446249
>>> ceph-23 (rank=3): num_stray=3412
>>> ceph-08 (rank=4): num_stray=1238
>>> ceph-15 (rank=5): num_stray=1486
>>> ceph-16 (rank=6): num_stray=5545
>>> ceph-11 (rank=7): num_stray=2995
>>>
>>> The stats for rank 2 are almost certainly out of date though. The
>>> config dump is large, but since you asked. Its only 3 settings that
>>> are present for maintenance and workaround reasons:
>>> mds_beacon_grace, auth_service_ticket_ttl and
>>> mon_osd_report_timeout. The last is for a different issue though.
>>>
>>> WHO     MASK            LEVEL
>>>    OPTION                        VALUE       RO
>>> global  advanced        auth_service_ticket_ttl
>>>    129600.000000
>>> global  advanced        mds_beacon_grace
>>>    1209600.000000
>>> global  advanced        mon_pool_quota_crit_threshold                   90
>>> global  advanced        mon_pool_quota_warn_threshold                   70
>>> global  dev             mon_warn_on_pool_pg_num_not_power_of_two
>>>     false
>>> global  advanced        osd_map_message_max_bytes
>>>     16384
>>> global  advanced        osd_op_queue
>>>    wpq                           *
>>> global  advanced        osd_op_queue_cut_off
>>>    high                          *
>>> global  advanced        osd_pool_default_pg_autoscale_mode              off
>>> mon     advanced        mon_allow_pool_delete
>>>     false
>>> mon     advanced        mon_osd_down_out_subtree_limit
>>>      host
>>> mon     advanced        mon_osd_min_down_reporters                      3
>>> mon     advanced        mon_osd_report_timeout
>>>     86400
>>> mon     advanced        mon_osd_reporter_subtree_level
>>>      host
>>> mon     advanced        mon_pool_quota_warn_threshold                   70
>>> mon     advanced        mon_sync_max_payload_size
>>>      4096
>>> mon     advanced        mon_warn_on_insecure_global_id_reclaim
>>>     false
>>> mon     advanced
>>> mon_warn_on_insecure_global_id_reclaim_allowed  false
>>> mgr     advanced        mgr/balancer/active
>>>     false
>>> mgr     advanced        mgr/dashboard/ceph-01/server_addr
>>>    10.40.88.65                   *
>>> mgr     advanced        mgr/dashboard/ceph-02/server_addr
>>>    10.40.88.66                   *
>>> mgr     advanced        mgr/dashboard/ceph-03/server_addr
>>>    10.40.88.67                   *
>>> mgr     advanced        mgr/dashboard/server_port
>>>    8443                          *
>>> mgr     advanced        mon_pg_warn_max_object_skew
>>>    10.000000
>>> mgr     basic           target_max_misplaced_ratio
>>>    1.000000
>>> osd     advanced        bluefs_buffered_io
>>>      true
>>> osd     advanced        bluestore_compression_min_blob_size_hdd
>>>    262144
>>> osd     advanced        bluestore_compression_min_blob_size_ssd
>>>     65536
>>> osd     advanced        bluestore_compression_mode
>>>    aggressive
>>> osd     class:rbd_perf  advanced
>>>    bluestore_compression_mode    none
>>> osd     dev             bluestore_fsck_quick_fix_on_mount
>>>     false
>>> osd     advanced        osd_deep_scrub_randomize_ratio
>>>    0.000000
>>> osd     class:hdd       advanced
>>>    osd_delete_sleep              300.000000
>>> osd     advanced        osd_fast_shutdown
>>>     false
>>> osd     class:fs_meta   advanced
>>>    osd_max_backfills             12
>>> osd     class:hdd       advanced
>>>    osd_max_backfills             3
>>> osd     class:rbd_data  advanced
>>>    osd_max_backfills             6
>>> osd     class:rbd_meta  advanced
>>>    osd_max_backfills             12
>>> osd     class:rbd_perf  advanced
>>>    osd_max_backfills             12
>>> osd     class:ssd       advanced
>>>    osd_max_backfills             12
>>> osd     advanced        osd_max_backfills                               3
>>> osd     class:fs_meta   dev
>>>    osd_memory_cache_min          2147483648
>>> osd     class:hdd       dev
>>>    osd_memory_cache_min          1073741824
>>> osd     class:rbd_data  dev
>>>    osd_memory_cache_min          2147483648
>>> osd     class:rbd_meta  dev
>>>    osd_memory_cache_min          1073741824
>>> osd     class:rbd_perf  dev
>>>    osd_memory_cache_min          2147483648
>>> osd     class:ssd       dev
>>>    osd_memory_cache_min          2147483648
>>> osd     dev             osd_memory_cache_min
>>>    805306368
>>> osd     class:fs_meta   basic
>>>    osd_memory_target             6442450944
>>> osd     class:hdd       basic
>>>    osd_memory_target             3221225472
>>> osd     class:rbd_data  basic
>>>    osd_memory_target             4294967296
>>> osd     class:rbd_meta  basic
>>>    osd_memory_target             2147483648
>>> osd     class:rbd_perf  basic
>>>    osd_memory_target             6442450944
>>> osd     class:ssd       basic
>>>    osd_memory_target             4294967296
>>> osd     basic           osd_memory_target
>>>    2147483648
>>> osd     class:rbd_perf  advanced
>>>    osd_op_num_threads_per_shard  4           *
>>> osd     class:hdd       advanced
>>>    osd_recovery_delay_start      600.000000
>>> osd     class:rbd_data  advanced
>>>    osd_recovery_delay_start      300.000000
>>> osd     class:rbd_perf  advanced
>>>    osd_recovery_delay_start      300.000000
>>> osd     class:fs_meta   advanced
>>>    osd_recovery_max_active       32
>>> osd     class:hdd       advanced
>>>    osd_recovery_max_active       8
>>> osd     class:rbd_data  advanced
>>>    osd_recovery_max_active       16
>>> osd     class:rbd_meta  advanced
>>>    osd_recovery_max_active       32
>>> osd     class:rbd_perf  advanced
>>>    osd_recovery_max_active       16
>>> osd     class:ssd       advanced
>>>    osd_recovery_max_active       32
>>> osd     advanced        osd_recovery_max_active                         8
>>> osd     class:fs_meta   advanced
>>>    osd_recovery_sleep            0.002500
>>> osd     class:hdd       advanced
>>>    osd_recovery_sleep            0.050000
>>> osd     class:rbd_data  advanced
>>>    osd_recovery_sleep            0.025000
>>> osd     class:rbd_meta  advanced
>>>    osd_recovery_sleep            0.002500
>>> osd     class:rbd_perf  advanced
>>>    osd_recovery_sleep            0.010000
>>> osd     class:ssd       advanced
>>>    osd_recovery_sleep            0.002500
>>> osd     advanced        osd_recovery_sleep
>>>    0.050000
>>> osd     class:hdd       dev
>>>    osd_scrub_backoff_ratio       0.330000
>>> osd     class:hdd       advanced
>>>    osd_scrub_during_recovery     true
>>> osd     advanced        osd_scrub_load_threshold
>>>    0.750000
>>> osd     class:fs_meta   advanced
>>>    osd_snap_trim_sleep           0.050000
>>> osd     class:hdd       advanced
>>>    osd_snap_trim_sleep           2.000000
>>> osd     class:rbd_data  advanced
>>>    osd_snap_trim_sleep           0.100000
>>> mds     basic           client_cache_size
>>>      8192
>>> mds     advanced        defer_client_eviction_on_laggy_osds
>>>     false
>>> mds     advanced        mds_bal_fragment_size_max
>>>    100000
>>> mds     basic           mds_cache_memory_limit
>>>    25769803776
>>> mds     advanced        mds_cache_reservation
>>>    0.500000
>>> mds     advanced        mds_max_caps_per_client
>>>     65536
>>> mds     advanced        mds_min_caps_per_client
>>>      4096
>>> mds     advanced        mds_recall_max_caps
>>>     32768
>>> mds     advanced        mds_session_blocklist_on_timeout
>>>     false
>>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Bailey Allison <ballison@xxxxxxxxxxxx>
>>> Sent: Thursday, January 16, 2025 10:08 PM
>>> To: ceph-users@xxxxxxx
>>> Subject:  Re: MDS hung in purge_stale_snap_data after
>>> populating cache
>>>
>>> Frank,
>>>
>>> Are you able to share an update to date ceph config dump and ceph daemon
>>> mds.X perf dump | grep strays from the cluster?
>>>
>>> We're just getting through our comically long ceph outage, so i'd like
>>> to be able to share the love here hahahaha
>>>
>>> Regards,
>>>
>>> Bailey Allison
>>> Service Team Lead
>>> 45Drives, Ltd.
>>> 866-594-7199 x868
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx