Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

Özkan Göksu <ozkangksu@xxxxxxxxx> · Sat, 27 Jan 2024 04:08:23 +0300

Wow I noticed something!

To prevent ram overflow with gpu training allocations, I'm using a 2TB
Samsung 870 evo for swap.

As you can see below, swap usage 18Gi and server was idle, that means maybe
ceph client hits latency because of the swap usage.

root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
free -h
               total        used        free      shared  buff/cache
available
Mem:            62Gi        34Gi        27Gi       0.0Ki       639Mi
 27Gi
Swap:          1.8Ti        18Gi       1.8Ti

I decided to play around kernel parameters to prevent ceph swap usage.

kernel.shmmax = 60654764851   # Maximum shared segment size in bytes
> kernel.shmall = 16453658   # Maximum number of shared memory segments in
> pages
> vm.nr_hugepages = 4096   # Increase Transparent Huge Pages (THP) Defrag:
> vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping
> vm.min_free_kbytes = 1048576 # required free memory (set to 1% of physical
> ram)

I reboot the server and after reboot swap usage is 0 as expected.

To give a try I started the iobench.sh (
https://github.com/ozkangoksu/benchmark/blob/main/iobench.sh)
This client has 1G nic only. As you can see below, other then 4K block
size, ceph client can saturate NIC.

root@bmw-m4:~# nicstat -MUz 1
    Time      Int   rMbps   wMbps   rPk/s   wPk/s    rAvs    wAvs %rUtil
%wUtil
01:04:48   ens1f0   936.9   92.90 91196.8 60126.3  1346.6   202.5   98.2
9.74

root@bmw-m4:/mounts/ud-data/benchuser1/96f13211-c37f-42db-8d05-f3255a05129e/testdir#
bash iobench.sh
Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    write: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27395msec); 0 zone
resets
 BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27462msec); 0 zone
resets
 BS=64K   write: IOPS=1758, BW=110MiB/s (115MB/s)(3072MiB/27948msec); 0
zone resets
 BS=32K   write: IOPS=3542, BW=111MiB/s (116MB/s)(3072MiB/27748msec); 0
zone resets
 BS=16K   write: IOPS=6839, BW=107MiB/s (112MB/s)(3072MiB/28747msec); 0
zone resets
 BS=4K    write: IOPS=8473, BW=33.1MiB/s (34.7MB/s)(3072MiB/92813msec); 0
zone resets
Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    read: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27386msec)
 BS=128K  read: IOPS=895, BW=112MiB/s (117MB/s)(3072MiB/27431msec)
 BS=64K   read: IOPS=1788, BW=112MiB/s (117MB/s)(3072MiB/27486msec)
 BS=32K   read: IOPS=3561, BW=111MiB/s (117MB/s)(3072MiB/27603msec)
 BS=16K   read: IOPS=6924, BW=108MiB/s (113MB/s)(3072MiB/28392msec)
 BS=4K    read: IOPS=21.3k, BW=83.3MiB/s (87.3MB/s)(3072MiB/36894msec)
Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    write: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27406msec); 0 zone
resets
 BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27466msec); 0 zone
resets
 BS=64K   write: IOPS=1781, BW=111MiB/s (117MB/s)(3072MiB/27591msec); 0
zone resets
 BS=32K   write: IOPS=3545, BW=111MiB/s (116MB/s)(3072MiB/27729msec); 0
zone resets
 BS=16K   write: IOPS=6823, BW=107MiB/s (112MB/s)(3072MiB/28814msec); 0
zone resets
 BS=4K    write: IOPS=12.7k, BW=49.8MiB/s (52.2MB/s)(3072MiB/61694msec); 0
zone resets
Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    read: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27388msec)
 BS=128K  read: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27479msec)
 BS=64K   read: IOPS=1784, BW=112MiB/s (117MB/s)(3072MiB/27547msec)
 BS=32K   read: IOPS=3559, BW=111MiB/s (117MB/s)(3072MiB/27614msec)
 BS=16K   read: IOPS=7047, BW=110MiB/s (115MB/s)(3072MiB/27897msec)
 BS=4K    read: IOPS=26.9k, BW=105MiB/s (110MB/s)(3072MiB/29199msec)

root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702#
cat metrics
item                               total
------------------------------------------
opened files  / total inodes       0 / 109
pinned i_caps / total inodes       109 / 109
opened inodes / total inodes       0 / 109

item          total       avg_lat(us)     min_lat(us)     max_lat(us)
stdev(us)
-----------------------------------------------------------------------------------
read          2316289     13904           221             8827984
760
write         2317824     21152           2975            9243821
2365
metadata      170         5944            225             202505
 24314

item          total       avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
 total_sz(bytes)
----------------------------------------------------------------------------------------
read          2316289     16688           4096            1048576
38654712361
write         2317824     19457           4096            4194304
45097156608

item          total           miss            hit
-------------------------------------------------
d_lease       112             3               858
caps          109             58              6963547

root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702#
free -h
               total        used        free      shared  buff/cache
available
Mem:            62Gi        11Gi        50Gi       3.0Mi       1.0Gi
 49Gi
Swap:          1.8Ti          0B       1.8Ti

I started to feel we are getting closer :)

Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 02:58 tarihinde şunu
yazdı:

> I started to investigate my clients.
>
> for example:
>
> root@ud-01:~# ceph health detail
> HEALTH_WARN 1 clients failing to respond to cache pressure
> [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
>     mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to respond to
> cache pressure client_id: 1275577
>
> root@ud-01:~# ceph fs status
> ud-data - 86 clients
> =======
> RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS
> CAPS
>  0    active  ud-data.ud-02.xcoojt  Reqs:   34 /s  2926k  2827k   155k
>  1157k
>
>
> ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] | "clientid:
> \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases),
> request_load_avg: \(.request_load_avg), num_completed_requests:
> \(.num_completed_requests), num_completed_flushes:
> \(.num_completed_flushes)"' | sort -n -t: -k3
>
> clientid: *1275577*= num_caps: 12312, num_leases: 0, request_load_avg: 0,
> num_completed_requests: 0, num_completed_flushes: 1
> clientid: 1275571= num_caps: 16307, num_leases: 1, request_load_avg: 2101,
> num_completed_requests: 0, num_completed_flushes: 3
> clientid: 1282130= num_caps: 26337, num_leases: 3, request_load_avg: 116,
> num_completed_requests: 0, num_completed_flushes: 1
> clientid: 1191789= num_caps: 32784, num_leases: 0, request_load_avg: 1846,
> num_completed_requests: 0, num_completed_flushes: 0
> clientid: 1275535= num_caps: 79825, num_leases: 2, request_load_avg: 133,
> num_completed_requests: 8, num_completed_flushes: 8
> clientid: 1282142= num_caps: 80581, num_leases: 6, request_load_avg: 125,
> num_completed_requests: 2, num_completed_flushes: 6
> clientid: 1275532= num_caps: 87836, num_leases: 3, request_load_avg: 190,
> num_completed_requests: 2, num_completed_flushes: 6
> clientid: 1275547= num_caps: 94129, num_leases: 4, request_load_avg: 149,
> num_completed_requests: 2, num_completed_flushes: 4
> clientid: 1275553= num_caps: 96460, num_leases: 4, request_load_avg: 155,
> num_completed_requests: 2, num_completed_flushes: 8
> clientid: 1282139= num_caps: 108882, num_leases: 25, request_load_avg: 99,
> num_completed_requests: 2, num_completed_flushes: 4
> clientid: 1275538= num_caps: 437162, num_leases: 0, request_load_avg: 101,
> num_completed_requests: 2, num_completed_flushes: 0
>
> --------------------------------------
>
> *MY CLIENT:*
>
> The client is actually at idle mode and there is no reason to fail at all.
>
> root@bmw-m4:~# apt list --installed |grep ceph
> ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 [installed]
> libcephfs2/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
> [installed,automatic]
> python3-ceph-argparse/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
> [installed,automatic]
> python3-ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 all
> [installed,automatic]
> python3-cephfs/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
> [installed,automatic]
>
> Let's check metrics and stats:
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> cat metrics
> item                               total
> ------------------------------------------
> opened files  / total inodes       2 / 12312
> pinned i_caps / total inodes       12312 / 12312
> opened inodes / total inodes       1 / 12312
>
> item          total       avg_lat(us)     min_lat(us)     max_lat(us)
> stdev(us)
>
> -----------------------------------------------------------------------------------
> read          22283       44409           430             1804853
> 15619
> write         112702      419725          3658            8879541
> 6008
> metadata      353322      5712            154             917903
>  5357
>
> item          total       avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
>  total_sz(bytes)
>
> ----------------------------------------------------------------------------------------
> read          22283       1701940         1               4194304
> 37924318602
> write         112702      246211          1               4194304
> 27748469309
>
> item          total           miss            hit
> -------------------------------------------------
> d_lease       62              63627           28564698
> caps          12312           36658           44568261
>
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> cat bdi/stats
> BdiWriteback:                0 kB
> BdiReclaimable:            800 kB
> BdiDirtyThresh:              0 kB
> DirtyThresh:           5795340 kB
> BackgroundThresh:      2894132 kB
> BdiDirtied:           27316320 kB
> BdiWritten:           27316320 kB
> BdiWriteBandwidth:        1472 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
>
>
> Last 3 days dmesg output:
>
> [Wed Jan 24 16:45:13 2024] xfsettingsd[653036]: segfault at 18 ip
> 00007fbd12f5d337 sp 00007ffd254332a0 error 4 in
> libxklavier.so.16.4.0[7fbd12f4d000+19000]
> [Wed Jan 24 16:45:13 2024] Code: 4c 89 e7 e8 0b 56 ff ff 48 89 03 48 8b 5c
> 24 30 e9 d1 fd ff ff e8 b9 5b ff ff 66 0f 1f 84 00 00 00 00 00 41 54 55 48
> 89 f5 53 <48> 8b 42 18 48 89 d1 49 89 fc 48 89 d3 48 89 fa 48 89 ef 48 8b b0
> [Thu Jan 25 06:51:31 2024] NVRM: GPU at PCI:0000:81:00:
> GPU-02efbb18-c9e4-3a16-d615-598959520b99
> [Thu Jan 25 06:51:31 2024] NVRM: GPU Board Serial Number: 1321421015411
> [Thu Jan 25 06:51:31 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=683281,
> name=python, Ch 00000008
> [Thu Jan 25 06:56:49 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=683377,
> name=python, Ch 00000018
> [Thu Jan 25 20:14:13 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=696062,
> name=python, Ch 00000008
> [Fri Jan 26 04:05:40 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=700166,
> name=python, Ch 00000008
> [Fri Jan 26 05:05:12 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=700320,
> name=python, Ch 00000008
> [Fri Jan 26 05:44:50 2024] NVRM: GPU at PCI:0000:82:00:
> GPU-3af62a2c-e7eb-a7d5-c073-22f06dc7065f
> [Fri Jan 26 05:44:50 2024] NVRM: GPU Board Serial Number: 1321421010400
> [Fri Jan 26 05:44:50 2024] NVRM: Xid (PCI:0000:82:00): 43, pid=700757,
> name=python, Ch 00000018
> [Fri Jan 26 05:56:02 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=701096,
> name=python, Ch 00000028
> [Fri Jan 26 06:34:20 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=701226,
> name=python, Ch 00000038
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> free -h
>                total        used        free      shared  buff/cache
> available
> Mem:            62Gi        34Gi        27Gi       0.0Ki       639Mi
>  27Gi
> Swap:          1.8Ti        18Gi       1.8Ti
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> cat /proc/vmstat
> nr_free_pages 7231171
> nr_zone_inactive_anon 7924766
> nr_zone_active_anon 525190
> nr_zone_inactive_file 44029
> nr_zone_active_file 55966
> nr_zone_unevictable 13042
> nr_zone_write_pending 3
> nr_mlock 13042
> nr_bounce 0
> nr_zspages 0
> nr_free_cma 0
> numa_hit 6701928919
> numa_miss 312628341
> numa_foreign 312628341
> numa_interleave 31538
> numa_local 6701864751
> numa_other 312692567
> nr_inactive_anon 7924766
> nr_active_anon 525190
> nr_inactive_file 44029
> nr_active_file 55966
> nr_unevictable 13042
> nr_slab_reclaimable 61076
> nr_slab_unreclaimable 63509
> nr_isolated_anon 0
> nr_isolated_file 0
> workingset_nodes 3934
> workingset_refault_anon 30325493
> workingset_refault_file 14593094
> workingset_activate_anon 5376050
> workingset_activate_file 3250679
> workingset_restore_anon 292317
> workingset_restore_file 1166673
> workingset_nodereclaim 488665
> nr_anon_pages 8451968
> nr_mapped 35731
> nr_file_pages 138824
> nr_dirty 3
> nr_writeback 0
> nr_writeback_temp 0
> nr_shmem 242
> nr_shmem_hugepages 0
> nr_shmem_pmdmapped 0
> nr_file_hugepages 0
> nr_file_pmdmapped 0
> nr_anon_transparent_hugepages 3588
> nr_vmscan_write 33746573
> nr_vmscan_immediate_reclaim 160
> nr_dirtied 48165341
> nr_written 80207893
> nr_kernel_misc_reclaimable 0
> nr_foll_pin_acquired 174002
> nr_foll_pin_released 174002
> nr_kernel_stack 60032
> nr_page_table_pages 46041
> nr_swapcached 36166
> nr_dirty_threshold 1448010
> nr_dirty_background_threshold 723121
> pgpgin 129904699
> pgpgout 299261581
> pswpin 30325493
> pswpout 45158221
> pgalloc_dma 1024
> pgalloc_dma32 57788566
> pgalloc_normal 6956384725
> pgalloc_movable 0
> allocstall_dma 0
> allocstall_dma32 0
> allocstall_normal 188
> allocstall_movable 63024
> pgskip_dma 0
> pgskip_dma32 0
> pgskip_normal 0
> pgskip_movable 0
> pgfree 7222273815
> pgactivate 1371753960
> pgdeactivate 18329381
> pglazyfree 10
> pgfault 7795723861
> pgmajfault 4600007
> pglazyfreed 0
> pgrefill 18575528
> pgreuse 81910383
> pgsteal_kswapd 980532060
> pgsteal_direct 38942066
> pgdemote_kswapd 0
> pgdemote_direct 0
> pgscan_kswapd 1135293298
> pgscan_direct 58883653
> pgscan_direct_throttle 15
> pgscan_anon 220939938
> pgscan_file 973237013
> pgsteal_anon 46538607
> pgsteal_file 972935519
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 25879882
> kswapd_inodesteal 2179831
> kswapd_low_wmark_hit_quickly 152797
> kswapd_high_wmark_hit_quickly 32025
> pageoutrun 204447
> pgrotated 44963935
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 2724410955
> numa_huge_pte_updates 1695890
> numa_hint_faults 1739823254
> numa_hint_faults_local 1222358972
> numa_pages_migrated 312611639
> pgmigrate_success 510846802
> pgmigrate_fail 875493
> thp_migration_success 156413
> thp_migration_fail 2
> thp_migration_split 0
> compact_migrate_scanned 1274073243
> compact_free_scanned 8430842597
> compact_isolated 400278352
> compact_stall 145300
> compact_fail 128562
> compact_success 16738
> compact_daemon_wake 170247
> compact_daemon_migrate_scanned 35486283
> compact_daemon_free_scanned 369870412
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 2774290
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 2675031
> unevictable_pgs_mlocked 2813622
> unevictable_pgs_munlocked 2674972
> unevictable_pgs_cleared 84231
> unevictable_pgs_stranded 84225
> thp_fault_alloc 416468
> thp_fault_fallback 19181
> thp_fault_fallback_charge 0
> thp_collapse_alloc 17931
> thp_collapse_alloc_failed 76
> thp_file_alloc 0
> thp_file_fallback 0
> thp_file_fallback_charge 0
> thp_file_mapped 0
> thp_split_page 2
> thp_split_page_failed 0
> thp_deferred_split_page 66
> thp_split_pmd 22451
> thp_split_pud 0
> thp_zero_page_alloc 1
> thp_zero_page_alloc_failed 0
> thp_swpout 22332
> thp_swpout_fallback 0
> balloon_inflate 0
> balloon_deflate 0
> balloon_migrate 0
> swap_ra 25777929
> swap_ra_hit 25658825
> direct_map_level2_splits 1249
> direct_map_level3_splits 49
> nr_unstable 0
>
>
>
> Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 02:36 tarihinde şunu
> yazdı:
>
>> Hello Frank.
>>
>> I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel:
>> Linux 5.4.0-125-generic
>>
>> My cluster 17.2.6 quincy.
>> I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I
>> wonder using new version clients is the main problem?
>> Maybe I have a communication error. For example I hit this problem and I
>> can not collect client stats "
>> https://github.com/ceph/ceph/pull/52127/files";
>>
>> Best regards.
>>
>>
>>
>> Frank Schilder <frans@xxxxxx>, 26 Oca 2024 Cum, 14:53 tarihinde şunu
>> yazdı:
>>
>>> Hi, this message is one of those that are often spurious. I don't recall
>>> in which thread/PR/tracker I read it, but the story was something like that:
>>>
>>> If an MDS gets under memory pressure it will request dentry items back
>>> from *all* clients, not just the active ones or the ones holding many of
>>> them. If you have a client that's below the min-threshold for dentries (its
>>> one of the client/mds tuning options), it will not respond. This client
>>> will be flagged as not responding, which is a false positive.
>>>
>>> I believe the devs are working on a fix to get rid of these spurious
>>> warnings. There is a "bug/feature" in the MDS that does not clear this
>>> warning flag for inactive clients. Hence, the message hangs and never
>>> disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches"
>>> on the client. However, except for being annoying in the dashboard, it has
>>> no performance or otherwise negative impact.
>>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Eugen Block <eblock@xxxxxx>
>>> Sent: Friday, January 26, 2024 10:05 AM
>>> To: Özkan Göksu
>>> Cc: ceph-users@xxxxxxx
>>> Subject:  Re: 1 clients failing to respond to cache pressure
>>> (quincy:17.2.6)
>>>
>>> Performance for small files is more about IOPS rather than throughput,
>>> and the IOPS in your fio tests look okay to me. What you could try is
>>> to split the PGs to get around 150 or 200 PGs per OSD. You're
>>> currently at around 60 according to the ceph osd df output. Before you
>>> do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
>>> head'? I don't need the whole output, just to see how many objects
>>> each PG has. We had a case once where that helped, but it was an older
>>> cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
>>> So this might not be the solution here, but it could improve things as
>>> well.
>>>
>>>
>>> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>:
>>>
>>> > Every user has a 1x subvolume and I only have 1 pool.
>>> > At the beginning we were using each subvolume for ldap home directory +
>>> > user data.
>>> > When a user logins any docker on any host, it was using the cluster for
>>> > home and the for user related data, we was have second directory in the
>>> > same subvolume.
>>> > Time to time users were feeling a very slow home environment and after
>>> a
>>> > month it became almost impossible to use home. VNC sessions became
>>> > unresponsive and slow etc.
>>> >
>>> > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
>>> > performance is better for only user_data without home.
>>> > But still the performance is not good enough as I expected because of
>>> the
>>> > problems related to MDS.
>>> > The usage is low but allocation is high and Cpu usage is high. You saw
>>> the
>>> > IO Op/s, it's nothing but allocation is high.
>>> >
>>> > I develop a fio benchmark script and I run the script on 4x test
>>> server at
>>> > the same time, the results are below:
>>> > Script:
>>> >
>>> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
>>> >
>>> >
>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
>>> >
>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
>>> >
>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
>>> >
>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
>>> >
>>> > While running benchmark, I take sample values for each type of iobench
>>> run.
>>> >
>>> > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>> >     client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
>>> >     client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
>>> >     client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
>>> >
>>> > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>> >     client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
>>> >     client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
>>> >
>>> > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>> >     client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
>>> >     client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
>>> >     client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr
>>> >
>>> > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>> >     client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
>>> >     client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
>>> >     client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
>>> >     client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr
>>> >
>>> > It seems I only have problems with the 4K,8K,16K other sector sizes.
>>> >
>>> >
>>> >
>>> >
>>> > Eugen Block <eblock@xxxxxx>, 25 Oca 2024 Per, 19:06 tarihinde şunu
>>> yazdı:
>>> >
>>> >> I understand that your MDS shows a high CPU usage, but other than that
>>> >> what is your performance issue? Do users complain? Do some operations
>>> >> take longer than expected? Are OSDs saturated during those phases?
>>> >> Because the cache pressure messages don’t necessarily mean that users
>>> >> will notice.
>>> >> MDS daemons are single-threaded so that might be a bottleneck. In that
>>> >> case multi-active mds might help, which you already tried and
>>> >> experienced OOM killers. But you might have to disable the mds
>>> >> balancer as someone else mentioned. And then you could think about
>>> >> pinning, is it possible to split the CephFS into multiple
>>> >> subdirectories and pin them to different ranks?
>>> >> But first I’d still like to know what the performance issue really is.
>>> >>
>>> >> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>:
>>> >>
>>> >> > I will try my best to explain my situation.
>>> >> >
>>> >> > I don't have a separate mds server. I have 5 identical nodes, 3 of
>>> them
>>> >> > mons, and I use the other 2 as active and standby mds. (currently I
>>> have
>>> >> > left overs from max_mds 4)
>>> >> >
>>> >> > root@ud-01:~# ceph -s
>>> >> >   cluster:
>>> >> >     id:     e42fd4b0-313b-11ee-9a00-31da71873773
>>> >> >     health: HEALTH_WARN
>>> >> >             1 clients failing to respond to cache pressure
>>> >> >
>>> >> >   services:
>>> >> >     mon: 3 daemons, quorum ud-01,ud-02,ud-03 (age 9d)
>>> >> >     mgr: ud-01.qycnol(active, since 8d), standbys: ud-02.tfhqfd
>>> >> >     mds: 1/1 daemons up, 4 standby
>>> >> >     osd: 80 osds: 80 up (since 9d), 80 in (since 5M)
>>> >> >
>>> >> >   data:
>>> >> >     volumes: 1/1 healthy
>>> >> >     pools:   3 pools, 2305 pgs
>>> >> >     objects: 106.58M objects, 25 TiB
>>> >> >     usage:   45 TiB used, 101 TiB / 146 TiB avail
>>> >> >     pgs:     2303 active+clean
>>> >> >              2    active+clean+scrubbing+deep
>>> >> >
>>> >> >   io:
>>> >> >     client:   16 MiB/s rd, 3.4 MiB/s wr, 77 op/s rd, 23 op/s wr
>>> >> >
>>> >> > ------------------------------
>>> >> > root@ud-01:~# ceph fs status
>>> >> > ud-data - 84 clients
>>> >> > =======
>>> >> > RANK  STATE           MDS              ACTIVITY     DNS    INOS
>>>  DIRS
>>> >> > CAPS
>>> >> >  0    active  ud-data.ud-02.xcoojt  Reqs:   40 /s  2579k  2578k
>>>  169k
>>> >> >  3048k
>>> >> >         POOL           TYPE     USED  AVAIL
>>> >> > cephfs.ud-data.meta  metadata   136G  44.9T
>>> >> > cephfs.ud-data.data    data    44.3T  44.9T
>>> >> >
>>> >> > ------------------------------
>>> >> > root@ud-01:~# ceph health detail
>>> >> > HEALTH_WARN 1 clients failing to respond to cache pressure
>>> >> > [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache
>>> pressure
>>> >> >     mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to
>>> respond to
>>> >> > cache pressure client_id: 1275577
>>> >> >
>>> >> > ------------------------------
>>> >> > When I check the failing client with session ls I see only
>>> "num_caps:
>>> >> 12298"
>>> >> >
>>> >> > ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] |
>>> "clientid:
>>> >> > \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases),
>>> >> > request_load_avg: \(.request_load_avg), num_completed_requests:
>>> >> > \(.num_completed_requests), num_completed_flushes:
>>> >> > \(.num_completed_flushes)"' | sort -n -t: -k3
>>> >> >
>>> >> > clientid: 1275577= num_caps: 12298, num_leases: 0,
>>> request_load_avg: 0,
>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>> >> > clientid: 1294542= num_caps: 13000, num_leases: 12,
>>> request_load_avg:
>>> >> 105,
>>> >> > num_completed_requests: 0, num_completed_flushes: 6
>>> >> > clientid: 1282187= num_caps: 16869, num_leases: 1,
>>> request_load_avg: 0,
>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>> >> > clientid: 1275589= num_caps: 18943, num_leases: 0,
>>> request_load_avg: 52,
>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>> >> > clientid: 1282154= num_caps: 24747, num_leases: 1,
>>> request_load_avg: 57,
>>> >> > num_completed_requests: 2, num_completed_flushes: 2
>>> >> > clientid: 1275553= num_caps: 25120, num_leases: 2,
>>> request_load_avg: 116,
>>> >> > num_completed_requests: 2, num_completed_flushes: 8
>>> >> > clientid: 1282142= num_caps: 27185, num_leases: 6,
>>> request_load_avg: 128,
>>> >> > num_completed_requests: 0, num_completed_flushes: 8
>>> >> > clientid: 1275535= num_caps: 40364, num_leases: 6,
>>> request_load_avg: 111,
>>> >> > num_completed_requests: 2, num_completed_flushes: 8
>>> >> > clientid: 1282130= num_caps: 41483, num_leases: 0,
>>> request_load_avg: 135,
>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>> >> > clientid: 1275547= num_caps: 42953, num_leases: 4,
>>> request_load_avg: 119,
>>> >> > num_completed_requests: 2, num_completed_flushes: 6
>>> >> > clientid: 1282139= num_caps: 45435, num_leases: 27,
>>> request_load_avg: 84,
>>> >> > num_completed_requests: 2, num_completed_flushes: 34
>>> >> > clientid: 1282136= num_caps: 48374, num_leases: 8,
>>> request_load_avg: 0,
>>> >> > num_completed_requests: 1, num_completed_flushes: 1
>>> >> > clientid: 1275532= num_caps: 48664, num_leases: 7,
>>> request_load_avg: 115,
>>> >> > num_completed_requests: 2, num_completed_flushes: 8
>>> >> > clientid: 1191789= num_caps: 130319, num_leases: 0,
>>> request_load_avg:
>>> >> 1753,
>>> >> > num_completed_requests: 0, num_completed_flushes: 0
>>> >> > clientid: 1275571= num_caps: 139488, num_leases: 0,
>>> request_load_avg: 2,
>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>> >> > clientid: 1282133= num_caps: 145487, num_leases: 0,
>>> request_load_avg: 8,
>>> >> > num_completed_requests: 1, num_completed_flushes: 1
>>> >> > clientid: 1534496= num_caps: 1041316, num_leases: 0,
>>> request_load_avg: 0,
>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>> >> >
>>> >> > ------------------------------
>>> >> > When I check the dashboard/service/mds I see %120+ CPU usage on
>>> active
>>> >> MDS
>>> >> > but on the host everything is almost idle and disk waits are very
>>> low.
>>> >> >
>>> >> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>> >> >            0.61    0.00    0.38    0.41    0.00   98.60
>>> >> >
>>> >> > Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
>>>  w/s
>>> >> >   wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s
>>> >> %drqm
>>> >> > d_await dareq-sz     f/s f_await  aqu-sz  %util
>>> >> > sdc              2.00      0.01     0.00   0.00    0.50     6.00
>>>  20.00
>>> >> >    0.04     0.00   0.00    0.50     2.00    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00   10.00    0.60    0.02   1.20
>>> >> > sdd              3.00      0.02     0.00   0.00    0.67     8.00
>>> 285.00
>>> >> >    1.84    77.00  21.27    0.44     6.61    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00  114.00    0.83    0.22  22.40
>>> >> > sde              1.00      0.01     0.00   0.00    1.00     8.00
>>>  36.00
>>> >> >    0.08     3.00   7.69    0.64     2.33    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00   18.00    0.67    0.04   1.60
>>> >> > sdf              5.00      0.04     0.00   0.00    0.40     7.20
>>>  40.00
>>> >> >    0.09     3.00   6.98    0.53     2.30    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00   20.00    0.70    0.04   2.00
>>> >> > sdg             11.00      0.08     0.00   0.00    0.73     7.27
>>>  36.00
>>> >> >    0.09     4.00  10.00    0.50     2.44    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00   18.00    0.72    0.04   3.20
>>> >> > sdh              5.00      0.03     0.00   0.00    0.60     5.60
>>>  46.00
>>> >> >    0.10     2.00   4.17    0.59     2.17    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00   23.00    0.83    0.05   2.80
>>> >> > sdi              7.00      0.04     0.00   0.00    0.43     6.29
>>>  36.00
>>> >> >    0.07     1.00   2.70    0.47     2.11    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00   18.00    0.61    0.03   2.40
>>> >> > sdj              5.00      0.04     0.00   0.00    0.80     7.20
>>>  42.00
>>> >> >    0.09     1.00   2.33    0.67     2.10    0.00      0.00     0.00
>>> >>  0.00
>>> >> >    0.00     0.00   21.00    0.81    0.05   3.20
>>> >> >
>>> >> > ------------------------------
>>> >> > Other than this 5x node cluster, I also have a 3x node cluster with
>>> >> > identical hardware but it serves for a different purpose and data
>>> >> workload.
>>> >> > In this cluster I don't have any problem and MDS default settings
>>> seems
>>> >> > enough.
>>> >> > The only difference between two cluster is, 5x node cluster used
>>> directly
>>> >> > by users, 3x node cluster used heavily to read and write data via
>>> >> projects
>>> >> > not by users. So allocate and de-allocate will be better.
>>> >> >
>>> >> > I guess I just have a problematic use case on the 5x node cluster
>>> and as
>>> >> I
>>> >> > mentioned above, I might have the similar problem but I don't know
>>> how to
>>> >> > debug it.
>>> >> >
>>> >> >
>>> >>
>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/
>>> >> > quote:"A user running VSCodium, keeping 15k caps open.. the
>>> opportunistic
>>> >> > caps recall eventually starts recalling those but the (el7 kernel)
>>> client
>>> >> > won't release them. Stopping Codium seems to be the only way to
>>> release."
>>> >> >
>>> >> > ------------------------------
>>> >> > Before reading the osd df you should know that I created 2x
>>> >> > OSD/per"CT4000MX500SSD1"
>>> >> > # ceph osd df tree
>>> >> > ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP
>>> >> META
>>> >> >     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
>>> >> >  -1         145.54321         -  146 TiB   45 TiB   44 TiB   119
>>> GiB  333
>>> >> > GiB  101 TiB  30.81  1.00    -          root default
>>> >> >  -3          29.10864         -   29 TiB  8.9 TiB  8.8 TiB    25
>>> GiB   66
>>> >> > GiB   20 TiB  30.54  0.99    -              host ud-01
>>> >> >   0    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.4
>>> GiB  4.5
>>> >> > GiB  1.2 TiB  33.04  1.07   61      up          osd.0
>>> >> >   1    ssd    1.81929   1.00000  1.8 TiB  527 GiB  521 GiB   1.5
>>> GiB  4.0
>>> >> > GiB  1.3 TiB  28.28  0.92   53      up          osd.1
>>> >> >   2    ssd    1.81929   1.00000  1.8 TiB  595 GiB  589 GiB   2.3
>>> GiB  4.0
>>> >> > GiB  1.2 TiB  31.96  1.04   63      up          osd.2
>>> >> >   3    ssd    1.81929   1.00000  1.8 TiB  527 GiB  521 GiB   1.8
>>> GiB  4.2
>>> >> > GiB  1.3 TiB  28.30  0.92   55      up          osd.3
>>> >> >   4    ssd    1.81929   1.00000  1.8 TiB  525 GiB  520 GiB   1.3
>>> GiB  3.9
>>> >> > GiB  1.3 TiB  28.21  0.92   52      up          osd.4
>>> >> >   5    ssd    1.81929   1.00000  1.8 TiB  592 GiB  586 GiB   1.8
>>> GiB  3.8
>>> >> > GiB  1.2 TiB  31.76  1.03   61      up          osd.5
>>> >> >   6    ssd    1.81929   1.00000  1.8 TiB  559 GiB  553 GiB   1.8
>>> GiB  4.3
>>> >> > GiB  1.3 TiB  30.03  0.97   57      up          osd.6
>>> >> >   7    ssd    1.81929   1.00000  1.8 TiB  602 GiB  597 GiB   836
>>> MiB  4.4
>>> >> > GiB  1.2 TiB  32.32  1.05   58      up          osd.7
>>> >> >   8    ssd    1.81929   1.00000  1.8 TiB  614 GiB  609 GiB   1.2
>>> GiB  4.5
>>> >> > GiB  1.2 TiB  32.98  1.07   60      up          osd.8
>>> >> >   9    ssd    1.81929   1.00000  1.8 TiB  571 GiB  565 GiB   2.2
>>> GiB  4.2
>>> >> > GiB  1.3 TiB  30.67  1.00   61      up          osd.9
>>> >> >  10    ssd    1.81929   1.00000  1.8 TiB  528 GiB  522 GiB   1.3
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  28.33  0.92   52      up          osd.10
>>> >> >  11    ssd    1.81929   1.00000  1.8 TiB  551 GiB  546 GiB   1.5
>>> GiB  3.6
>>> >> > GiB  1.3 TiB  29.57  0.96   56      up          osd.11
>>> >> >  12    ssd    1.81929   1.00000  1.8 TiB  594 GiB  588 GiB   1.8
>>> GiB  4.4
>>> >> > GiB  1.2 TiB  31.91  1.04   61      up          osd.12
>>> >> >  13    ssd    1.81929   1.00000  1.8 TiB  561 GiB  555 GiB   1.1
>>> GiB  4.3
>>> >> > GiB  1.3 TiB  30.10  0.98   55      up          osd.13
>>> >> >  14    ssd    1.81929   1.00000  1.8 TiB  616 GiB  609 GiB   1.9
>>> GiB  4.2
>>> >> > GiB  1.2 TiB  33.04  1.07   64      up          osd.14
>>> >> >  15    ssd    1.81929   1.00000  1.8 TiB  525 GiB  520 GiB   1.1
>>> GiB  4.0
>>> >> > GiB  1.3 TiB  28.20  0.92   51      up          osd.15
>>> >> >  -5          29.10864         -   29 TiB  9.0 TiB  8.9 TiB    22
>>> GiB   67
>>> >> > GiB   20 TiB  30.89  1.00    -              host ud-02
>>> >> >  16    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.7
>>> GiB  4.7
>>> >> > GiB  1.2 TiB  33.12  1.08   63      up          osd.16
>>> >> >  17    ssd    1.81929   1.00000  1.8 TiB  582 GiB  577 GiB   1.6
>>> GiB  4.0
>>> >> > GiB  1.3 TiB  31.26  1.01   59      up          osd.17
>>> >> >  18    ssd    1.81929   1.00000  1.8 TiB  583 GiB  578 GiB   418
>>> MiB  4.0
>>> >> > GiB  1.3 TiB  31.29  1.02   54      up          osd.18
>>> >> >  19    ssd    1.81929   1.00000  1.8 TiB  550 GiB  544 GiB   1.5
>>> GiB  4.0
>>> >> > GiB  1.3 TiB  29.50  0.96   56      up          osd.19
>>> >> >  20    ssd    1.81929   1.00000  1.8 TiB  551 GiB  546 GiB   1.1
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  29.57  0.96   54      up          osd.20
>>> >> >  21    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.3
>>> GiB  4.4
>>> >> > GiB  1.2 TiB  33.04  1.07   60      up          osd.21
>>> >> >  22    ssd    1.81929   1.00000  1.8 TiB  573 GiB  567 GiB   1.6
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  30.75  1.00   58      up          osd.22
>>> >> >  23    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.3
>>> GiB  4.3
>>> >> > GiB  1.2 TiB  33.06  1.07   60      up          osd.23
>>> >> >  24    ssd    1.81929   1.00000  1.8 TiB  539 GiB  534 GiB   844
>>> MiB  3.8
>>> >> > GiB  1.3 TiB  28.92  0.94   51      up          osd.24
>>> >> >  25    ssd    1.81929   1.00000  1.8 TiB  583 GiB  576 GiB   2.1
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  31.27  1.02   61      up          osd.25
>>> >> >  26    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.3
>>> GiB  4.6
>>> >> > GiB  1.2 TiB  33.12  1.08   61      up          osd.26
>>> >> >  27    ssd    1.81929   1.00000  1.8 TiB  537 GiB  532 GiB   1.2
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  28.84  0.94   53      up          osd.27
>>> >> >  28    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.3
>>> GiB  4.2
>>> >> > GiB  1.3 TiB  28.29  0.92   53      up          osd.28
>>> >> >  29    ssd    1.81929   1.00000  1.8 TiB  594 GiB  588 GiB   1.5
>>> GiB  4.6
>>> >> > GiB  1.2 TiB  31.91  1.04   59      up          osd.29
>>> >> >  30    ssd    1.81929   1.00000  1.8 TiB  528 GiB  523 GiB   1.4
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  28.35  0.92   53      up          osd.30
>>> >> >  31    ssd    1.81929   1.00000  1.8 TiB  594 GiB  589 GiB   1.6
>>> GiB  3.8
>>> >> > GiB  1.2 TiB  31.89  1.03   61      up          osd.31
>>> >> >  -7          29.10864         -   29 TiB  8.9 TiB  8.8 TiB    23
>>> GiB   67
>>> >> > GiB   20 TiB  30.66  1.00    -              host ud-03
>>> >> >  32    ssd    1.81929   1.00000  1.8 TiB  593 GiB  588 GiB   1.1
>>> GiB  4.3
>>> >> > GiB  1.2 TiB  31.84  1.03   57      up          osd.32
>>> >> >  33    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.8
>>> GiB  4.4
>>> >> > GiB  1.2 TiB  33.13  1.08   63      up          osd.33
>>> >> >  34    ssd    1.81929   1.00000  1.8 TiB  537 GiB  532 GiB   2.0
>>> GiB  3.8
>>> >> > GiB  1.3 TiB  28.84  0.94   59      up          osd.34
>>> >> >  35    ssd    1.81929   1.00000  1.8 TiB  562 GiB  556 GiB   1.7
>>> GiB  4.2
>>> >> > GiB  1.3 TiB  30.16  0.98   58      up          osd.35
>>> >> >  36    ssd    1.81929   1.00000  1.8 TiB  529 GiB  523 GiB   1.3
>>> GiB  3.9
>>> >> > GiB  1.3 TiB  28.38  0.92   52      up          osd.36
>>> >> >  37    ssd    1.81929   1.00000  1.8 TiB  527 GiB  521 GiB   1.7
>>> GiB  4.2
>>> >> > GiB  1.3 TiB  28.28  0.92   55      up          osd.37
>>> >> >  38    ssd    1.81929   1.00000  1.8 TiB  574 GiB  568 GiB   1.2
>>> GiB  4.3
>>> >> > GiB  1.3 TiB  30.79  1.00   55      up          osd.38
>>> >> >  39    ssd    1.81929   1.00000  1.8 TiB  605 GiB  599 GiB   1.6
>>> GiB  4.2
>>> >> > GiB  1.2 TiB  32.48  1.05   61      up          osd.39
>>> >> >  40    ssd    1.81929   1.00000  1.8 TiB  573 GiB  567 GiB   1.2
>>> GiB  4.4
>>> >> > GiB  1.3 TiB  30.76  1.00   56      up          osd.40
>>> >> >  41    ssd    1.81929   1.00000  1.8 TiB  526 GiB  520 GiB   1.7
>>> GiB  3.9
>>> >> > GiB  1.3 TiB  28.21  0.92   54      up          osd.41
>>> >> >  42    ssd    1.81929   1.00000  1.8 TiB  613 GiB  608 GiB  1010
>>> MiB  4.4
>>> >> > GiB  1.2 TiB  32.91  1.07   58      up          osd.42
>>> >> >  43    ssd    1.81929   1.00000  1.8 TiB  606 GiB  600 GiB   1.7
>>> GiB  4.3
>>> >> > GiB  1.2 TiB  32.51  1.06   61      up          osd.43
>>> >> >  44    ssd    1.81929   1.00000  1.8 TiB  583 GiB  577 GiB   1.6
>>> GiB  4.2
>>> >> > GiB  1.3 TiB  31.29  1.02   60      up          osd.44
>>> >> >  45    ssd    1.81929   1.00000  1.8 TiB  618 GiB  613 GiB   1.4
>>> GiB  4.3
>>> >> > GiB  1.2 TiB  33.18  1.08   62      up          osd.45
>>> >> >  46    ssd    1.81929   1.00000  1.8 TiB  550 GiB  544 GiB   1.5
>>> GiB  4.2
>>> >> > GiB  1.3 TiB  29.50  0.96   54      up          osd.46
>>> >> >  47    ssd    1.81929   1.00000  1.8 TiB  526 GiB  522 GiB   692
>>> MiB  3.7
>>> >> > GiB  1.3 TiB  28.25  0.92   50      up          osd.47
>>> >> >  -9          29.10864         -   29 TiB  9.0 TiB  8.9 TiB    26
>>> GiB   68
>>> >> > GiB   20 TiB  31.04  1.01    -              host ud-04
>>> >> >  48    ssd    1.81929   1.00000  1.8 TiB  540 GiB  534 GiB   2.2
>>> GiB  3.6
>>> >> > GiB  1.3 TiB  28.96  0.94   58      up          osd.48
>>> >> >  49    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.4
>>> GiB  4.5
>>> >> > GiB  1.2 TiB  33.11  1.07   61      up          osd.49
>>> >> >  50    ssd    1.81929   1.00000  1.8 TiB  618 GiB  612 GiB   1.2
>>> GiB  4.8
>>> >> > GiB  1.2 TiB  33.17  1.08   61      up          osd.50
>>> >> >  51    ssd    1.81929   1.00000  1.8 TiB  618 GiB  612 GiB   1.5
>>> GiB  4.5
>>> >> > GiB  1.2 TiB  33.19  1.08   61      up          osd.51
>>> >> >  52    ssd    1.81929   1.00000  1.8 TiB  526 GiB  521 GiB   1.4
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  28.25  0.92   53      up          osd.52
>>> >> >  53    ssd    1.81929   1.00000  1.8 TiB  618 GiB  611 GiB   2.4
>>> GiB  4.3
>>> >> > GiB  1.2 TiB  33.17  1.08   66      up          osd.53
>>> >> >  54    ssd    1.81929   1.00000  1.8 TiB  550 GiB  544 GiB   1.5
>>> GiB  4.3
>>> >> > GiB  1.3 TiB  29.54  0.96   55      up          osd.54
>>> >> >  55    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.3
>>> GiB  4.0
>>> >> > GiB  1.3 TiB  28.29  0.92   52      up          osd.55
>>> >> >  56    ssd    1.81929   1.00000  1.8 TiB  525 GiB  519 GiB   1.2
>>> GiB  4.1
>>> >> > GiB  1.3 TiB  28.16  0.91   52      up          osd.56
>>> >> >  57    ssd    1.81929   1.00000  1.8 TiB  615 GiB  609 GiB   2.3
>>> GiB  4.2
>>> >> > GiB  1.2 TiB  33.03  1.07   65      up          osd.57
>>> >> >  58    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.6
>>> GiB  3.7
>>> >> > GiB  1.3 TiB  28.31  0.92   55      up          osd.58
>>> >> >  59    ssd    1.81929   1.00000  1.8 TiB  615 GiB  609 GiB   1.2
>>> GiB  4.6
>>> >> > GiB  1.2 TiB  33.01  1.07   60      up          osd.59
>>> >> >  60    ssd    1.81929   1.00000  1.8 TiB  594 GiB  588 GiB   1.2
>>> GiB  4.4
>>> >> > GiB  1.2 TiB  31.88  1.03   59      up          osd.60
>>> >> >  61    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.9
>>> GiB  4.1
>>> >> > GiB  1.2 TiB  33.04  1.07   64      up          osd.61
>>> >> >  62    ssd    1.81929   1.00000  1.8 TiB  620 GiB  614 GiB   1.9
>>> GiB  4.4
>>> >> > GiB  1.2 TiB  33.27  1.08   63      up          osd.62
>>> >> >  63    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.5
>>> GiB  4.0
>>> >> > GiB  1.3 TiB  28.30  0.92   53      up          osd.63
>>> >> > -11          29.10864         -   29 TiB  9.0 TiB  8.9 TiB    23
>>> GiB   65
>>> >> > GiB   20 TiB  30.91  1.00    -              host ud-05
>>> >> >  64    ssd    1.81929   1.00000  1.8 TiB  608 GiB  601 GiB   2.3
>>> GiB  4.5
>>> >> > GiB  1.2 TiB  32.62  1.06   65      up          osd.64
>>> >> >  65    ssd    1.81929   1.00000  1.8 TiB  606 GiB  601 GiB   628
>>> MiB  4.2
>>> >> > GiB  1.2 TiB  32.53  1.06   57      up          osd.65
>>> >> >  66    ssd    1.81929   1.00000  1.8 TiB  583 GiB  578 GiB   1.3
>>> GiB  4.3
>>> >> > GiB  1.2 TiB  31.31  1.02   57      up          osd.66
>>> >> >  67    ssd    1.81929   1.00000  1.8 TiB  537 GiB  533 GiB   436
>>> MiB  3.6
>>> >> > GiB  1.3 TiB  28.82  0.94   50      up          osd.67
>>> >> >  68    ssd    1.81929   1.00000  1.8 TiB  541 GiB  535 GiB   2.5
>>> GiB  3.8
>>> >> > GiB  1.3 TiB  29.04  0.94   59      up          osd.68
>>> >> >  69    ssd    1.81929   1.00000  1.8 TiB  606 GiB  601 GiB   1.1
>>> GiB  4.4
>>> >> > GiB  1.2 TiB  32.55  1.06   59      up          osd.69
>>> >> >  70    ssd    1.81929   1.00000  1.8 TiB  604 GiB  598 GiB   1.8
>>> GiB  4.1
>>> >> > GiB  1.2 TiB  32.44  1.05   63      up          osd.70
>>> >> >  71    ssd    1.81929   1.00000  1.8 TiB  606 GiB  600 GiB   1.9
>>> GiB  4.5
>>> >> > GiB  1.2 TiB  32.53  1.06   62      up          osd.71
>>> >> >  72    ssd    1.81929   1.00000  1.8 TiB  602 GiB  598 GiB   612
>>> MiB  4.1
>>> >> > GiB  1.2 TiB  32.33  1.05   57      up          osd.72
>>> >> >  73    ssd    1.81929   1.00000  1.8 TiB  571 GiB  565 GiB   1.8
>>> GiB  4.5
>>> >> > GiB  1.3 TiB  30.65  0.99   58      up          osd.73
>>> >> >  74    ssd    1.81929   1.00000  1.8 TiB  608 GiB  602 GiB   1.8
>>> GiB  4.2
>>> >> > GiB  1.2 TiB  32.62  1.06   61      up          osd.74
>>> >> >  75    ssd    1.81929   1.00000  1.8 TiB  536 GiB  531 GiB   1.9
>>> GiB  3.5
>>> >> > GiB  1.3 TiB  28.80  0.93   57      up          osd.75
>>> >> >  76    ssd    1.81929   1.00000  1.8 TiB  605 GiB  599 GiB   1.4
>>> GiB  4.5
>>> >> > GiB  1.2 TiB  32.48  1.05   60      up          osd.76
>>> >> >  77    ssd    1.81929   1.00000  1.8 TiB  537 GiB  532 GiB   1.2
>>> GiB  3.9
>>> >> > GiB  1.3 TiB  28.84  0.94   52      up          osd.77
>>> >> >  78    ssd    1.81929   1.00000  1.8 TiB  525 GiB  520 GiB   1.3
>>> GiB  3.8
>>> >> > GiB  1.3 TiB  28.20  0.92   52      up          osd.78
>>> >> >  79    ssd    1.81929   1.00000  1.8 TiB  536 GiB  531 GiB   1.1
>>> GiB  3.3
>>> >> > GiB  1.3 TiB  28.76  0.93   53      up          osd.79
>>> >> >                           TOTAL  146 TiB   45 TiB   44 TiB   119
>>> GiB  333
>>> >> > GiB  101 TiB  30.81
>>> >> > MIN/MAX VAR: 0.91/1.08  STDDEV: 1.90
>>> >> >
>>> >> >
>>> >> >
>>> >> > Eugen Block <eblock@xxxxxx>, 25 Oca 2024 Per, 16:52 tarihinde şunu
>>> >> yazdı:
>>> >> >
>>> >> >> There is no definitive answer wrt mds tuning. As it is everywhere
>>> >> >> mentioned, it's about finding the right setup for your specific
>>> >> >> workload. If you can synthesize your workload (maybe scale down a
>>> bit)
>>> >> >> try optimizing it in a test cluster without interrupting your
>>> >> >> developers too much.
>>> >> >> But what you haven't explained yet is what are you experiencing as
>>> a
>>> >> >> performance issue? Do you have numbers or a detailed description?
>>> >> >>  From the fs status output you didn't seem to have too much
>>> activity
>>> >> >> going on (around 140 requests per second), but that's probably not
>>> the
>>> >> >> usual traffic? What does ceph report in its client IO output?
>>> >> >> Can you paste the 'ceph osd df' output as well?
>>> >> >> Do you have dedicated MDS servers or are they colocated with other
>>> >> >> services?
>>> >> >>
>>> >> >> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>:
>>> >> >>
>>> >> >> > Hello  Eugen.
>>> >> >> >
>>> >> >> > I read all of your MDS related topics and thank you so much for
>>> your
>>> >> >> effort
>>> >> >> > on this.
>>> >> >> > There is not much information and I couldn't find a MDS tuning
>>> guide
>>> >> at
>>> >> >> > all. It  seems that you are the correct person to discuss mds
>>> >> debugging
>>> >> >> and
>>> >> >> > tuning.
>>> >> >> >
>>> >> >> > Do you have any documents or may I learn what is the proper way
>>> to
>>> >> debug
>>> >> >> > MDS and clients ?
>>> >> >> > Which debug logs will guide me to understand the limitations and
>>> will
>>> >> >> help
>>> >> >> > to tune according to the data flow?
>>> >> >> >
>>> >> >> > While searching, I find this:
>>> >> >> >
>>> >> >>
>>> >>
>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/
>>> >> >> > quote:"A user running VSCodium, keeping 15k caps open.. the
>>> >> opportunistic
>>> >> >> > caps recall eventually starts recalling those but the (el7
>>> kernel)
>>> >> client
>>> >> >> > won't release them. Stopping Codium seems to be the only way to
>>> >> release."
>>> >> >> >
>>> >> >> > Because of this I think I also need to play around with the
>>> client
>>> >> side
>>> >> >> too.
>>> >> >> >
>>> >> >> > My main goal is increasing the speed and reducing the latency
>>> and I
>>> >> >> wonder
>>> >> >> > if these ideas are correct or not:
>>> >> >> > - Maybe I need to increase client side cache size because via
>>> each
>>> >> >> client,
>>> >> >> > multiple users request a lot of objects and clearly the
>>> >> >> > client_cache_size=16 default is not enough.
>>> >> >> > -  Maybe I need to increase client side maximum cache limit for
>>> >> >> > object "client_oc_max_objects=1000 to 10000" and data
>>> >> >> "client_oc_size=200mi
>>> >> >> > to 400mi"
>>> >> >> > - The client cache cleaning threshold is not aggressive enough
>>> to keep
>>> >> >> the
>>> >> >> > free cache size in the desired range. I need to make it
>>> aggressive but
>>> >> >> this
>>> >> >> > should not reduce speed and increase latency.
>>> >> >> >
>>> >> >> > mds_cache_memory_limit=4gi to 16gi
>>> >> >> > client_oc_max_objects=1000 to 10000
>>> >> >> > client_oc_size=200mi to 400mi
>>> >> >> > client_permissions=false #to reduce latency.
>>> >> >> > client_cache_size=16 to 128
>>> >> >> >
>>> >> >> >
>>> >> >> > What do you think?
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx