Re: MDS hung in purge_stale_snap_data after populating cache

Frank Schilder <frans@xxxxxx> · Mon, 20 Jan 2025 10:24:10 +0000

Hi Eugen,

thanks for your input. I can't query the hung MDS, but the others say this here:

ceph tell mds.ceph-14 perf dump throttle-write_buf_throttle
{
    "throttle-write_buf_throttle": {
        "val": 0,
        "max": 3758096384,
        "get_started": 0,
        "get": 5199,
        "get_sum": 566691,
        "get_or_fail_fail": 0,
        "get_or_fail_success": 5199,
        "take": 0,
        "take_sum": 0,
        "put": 719,
        "put_sum": 566691,
        "wait": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        }
    }
}

You might be on to something, we are also trying to find where this limit comes from.

Please keep us posted.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Monday, January 20, 2025 11:12 AM
To: ceph-users@xxxxxxx
Subject:  Re: MDS hung in purge_stale_snap_data after populating cache

Hi Frank,

are you able to query the daemon while it's trying to purge the snaps?

pacific:~ # ceph tell mds.{your_daemon} perf dump throttle-write_buf_throttle
...
         "max": 3758096384,

I don't know yet where that "max" setting comes from, but I'll keep looking.

Zitat von Frank Schilder <frans@xxxxxx>:

> Hi all,
>
> we tracked the deadlock down to line
> https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583
> in Journaler::append_entry(bufferlist& bl):
>
>   // append
>   size_t delta = bl.length() + journal_stream.get_envelope_size();
>   // write_buf space is nearly full
>   if (!write_buf_throttle.get_or_fail(delta)) {
>     l.unlock();
>     ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
>     write_buf_throttle.get(delta);  //<<<<<<<<< The MDS is stuck
> here <<<<<<<<<
>     l.lock();
>   }
>   ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;
>
> This is indicated by the last message in the log before the lock up,
> which reads
>
>   mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101
>
> and is generated by the line above the call
> write_buf_throttle.get(delta). All log messages messages before
> start with "write_buf_throttle get, delta", which means these did
> not go into the if-statement.
>
> Obvious question is, which parameter influences the maximum size of
> the variable Journaler::write_buffer
> (https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the
> class definition of class Journaler? Increasing this limit should
> get us past the deadlock.
>
> Thanks for your help and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: Friday, January 17, 2025 3:02 PM
> To: Bailey Allison; ceph-users@xxxxxxx
> Subject:  Re: MDS hung in purge_stale_snap_data after
> populating cache
>
> Hi Bailey.
>
> ceph-14 (rank=0): num_stray=205532
> ceph-13 (rank=1): num_stray=4446
> ceph-21-mds (rank=2): num_stray=99446249
> ceph-23 (rank=3): num_stray=3412
> ceph-08 (rank=4): num_stray=1238
> ceph-15 (rank=5): num_stray=1486
> ceph-16 (rank=6): num_stray=5545
> ceph-11 (rank=7): num_stray=2995
>
> The stats for rank 2 are almost certainly out of date though. The
> config dump is large, but since you asked. Its only 3 settings that
> are present for maintenance and workaround reasons:
> mds_beacon_grace, auth_service_ticket_ttl and
> mon_osd_report_timeout. The last is for a different issue though.
>
> WHO     MASK            LEVEL
>    OPTION                        VALUE       RO
> global  advanced        auth_service_ticket_ttl
>    129600.000000
> global  advanced        mds_beacon_grace
>    1209600.000000
> global  advanced        mon_pool_quota_crit_threshold                   90
> global  advanced        mon_pool_quota_warn_threshold                   70
> global  dev             mon_warn_on_pool_pg_num_not_power_of_two        false
> global  advanced        osd_map_message_max_bytes                       16384
> global  advanced        osd_op_queue
>    wpq                           *
> global  advanced        osd_op_queue_cut_off
>    high                          *
> global  advanced        osd_pool_default_pg_autoscale_mode              off
> mon     advanced        mon_allow_pool_delete                           false
> mon     advanced        mon_osd_down_out_subtree_limit                  host
> mon     advanced        mon_osd_min_down_reporters                      3
> mon     advanced        mon_osd_report_timeout                          86400
> mon     advanced        mon_osd_reporter_subtree_level                  host
> mon     advanced        mon_pool_quota_warn_threshold                   70
> mon     advanced        mon_sync_max_payload_size                       4096
> mon     advanced        mon_warn_on_insecure_global_id_reclaim          false
> mon     advanced        mon_warn_on_insecure_global_id_reclaim_allowed  false
> mgr     advanced        mgr/balancer/active                             false
> mgr     advanced        mgr/dashboard/ceph-01/server_addr
>    10.40.88.65                   *
> mgr     advanced        mgr/dashboard/ceph-02/server_addr
>    10.40.88.66                   *
> mgr     advanced        mgr/dashboard/ceph-03/server_addr
>    10.40.88.67                   *
> mgr     advanced        mgr/dashboard/server_port
>    8443                          *
> mgr     advanced        mon_pg_warn_max_object_skew
>    10.000000
> mgr     basic           target_max_misplaced_ratio
>    1.000000
> osd     advanced        bluefs_buffered_io                              true
> osd     advanced        bluestore_compression_min_blob_size_hdd
>    262144
> osd     advanced        bluestore_compression_min_blob_size_ssd         65536
> osd     advanced        bluestore_compression_mode
>    aggressive
> osd     class:rbd_perf  advanced
>    bluestore_compression_mode    none
> osd     dev             bluestore_fsck_quick_fix_on_mount               false
> osd     advanced        osd_deep_scrub_randomize_ratio
>    0.000000
> osd     class:hdd       advanced
>    osd_delete_sleep              300.000000
> osd     advanced        osd_fast_shutdown                               false
> osd     class:fs_meta   advanced
>    osd_max_backfills             12
> osd     class:hdd       advanced
>    osd_max_backfills             3
> osd     class:rbd_data  advanced
>    osd_max_backfills             6
> osd     class:rbd_meta  advanced
>    osd_max_backfills             12
> osd     class:rbd_perf  advanced
>    osd_max_backfills             12
> osd     class:ssd       advanced
>    osd_max_backfills             12
> osd     advanced        osd_max_backfills                               3
> osd     class:fs_meta   dev
>    osd_memory_cache_min          2147483648
> osd     class:hdd       dev
>    osd_memory_cache_min          1073741824
> osd     class:rbd_data  dev
>    osd_memory_cache_min          2147483648
> osd     class:rbd_meta  dev
>    osd_memory_cache_min          1073741824
> osd     class:rbd_perf  dev
>    osd_memory_cache_min          2147483648
> osd     class:ssd       dev
>    osd_memory_cache_min          2147483648
> osd     dev             osd_memory_cache_min
>    805306368
> osd     class:fs_meta   basic
>    osd_memory_target             6442450944
> osd     class:hdd       basic
>    osd_memory_target             3221225472
> osd     class:rbd_data  basic
>    osd_memory_target             4294967296
> osd     class:rbd_meta  basic
>    osd_memory_target             2147483648
> osd     class:rbd_perf  basic
>    osd_memory_target             6442450944
> osd     class:ssd       basic
>    osd_memory_target             4294967296
> osd     basic           osd_memory_target
>    2147483648
> osd     class:rbd_perf  advanced
>    osd_op_num_threads_per_shard  4           *
> osd     class:hdd       advanced
>    osd_recovery_delay_start      600.000000
> osd     class:rbd_data  advanced
>    osd_recovery_delay_start      300.000000
> osd     class:rbd_perf  advanced
>    osd_recovery_delay_start      300.000000
> osd     class:fs_meta   advanced
>    osd_recovery_max_active       32
> osd     class:hdd       advanced
>    osd_recovery_max_active       8
> osd     class:rbd_data  advanced
>    osd_recovery_max_active       16
> osd     class:rbd_meta  advanced
>    osd_recovery_max_active       32
> osd     class:rbd_perf  advanced
>    osd_recovery_max_active       16
> osd     class:ssd       advanced
>    osd_recovery_max_active       32
> osd     advanced        osd_recovery_max_active                         8
> osd     class:fs_meta   advanced
>    osd_recovery_sleep            0.002500
> osd     class:hdd       advanced
>    osd_recovery_sleep            0.050000
> osd     class:rbd_data  advanced
>    osd_recovery_sleep            0.025000
> osd     class:rbd_meta  advanced
>    osd_recovery_sleep            0.002500
> osd     class:rbd_perf  advanced
>    osd_recovery_sleep            0.010000
> osd     class:ssd       advanced
>    osd_recovery_sleep            0.002500
> osd     advanced        osd_recovery_sleep
>    0.050000
> osd     class:hdd       dev
>    osd_scrub_backoff_ratio       0.330000
> osd     class:hdd       advanced
>    osd_scrub_during_recovery     true
> osd     advanced        osd_scrub_load_threshold
>    0.750000
> osd     class:fs_meta   advanced
>    osd_snap_trim_sleep           0.050000
> osd     class:hdd       advanced
>    osd_snap_trim_sleep           2.000000
> osd     class:rbd_data  advanced
>    osd_snap_trim_sleep           0.100000
> mds     basic           client_cache_size                               8192
> mds     advanced        defer_client_eviction_on_laggy_osds             false
> mds     advanced        mds_bal_fragment_size_max
>    100000
> mds     basic           mds_cache_memory_limit
>    25769803776
> mds     advanced        mds_cache_reservation
>    0.500000
> mds     advanced        mds_max_caps_per_client                         65536
> mds     advanced        mds_min_caps_per_client                         4096
> mds     advanced        mds_recall_max_caps                             32768
> mds     advanced        mds_session_blocklist_on_timeout                false
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Bailey Allison <ballison@xxxxxxxxxxxx>
> Sent: Thursday, January 16, 2025 10:08 PM
> To: ceph-users@xxxxxxx
> Subject:  Re: MDS hung in purge_stale_snap_data after
> populating cache
>
> Frank,
>
> Are you able to share an update to date ceph config dump and ceph daemon
> mds.X perf dump | grep strays from the cluster?
>
> We're just getting through our comically long ceph outage, so i'd like
> to be able to share the love here hahahaha
>
> Regards,
>
> Bailey Allison
> Service Team Lead
> 45Drives, Ltd.
> 866-594-7199 x868
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx