Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

Özkan Göksu <ozkangksu@xxxxxxxxx> · Sat, 27 Jan 2024 04:53:30 +0300

I decided to tune cephfs client's kernels and increase network buffers to
increase speed.

This time my client has 1x 10Gbit DAC cable.
Client version is 1 step ahead: ceph-common/stable,now 17.2.7-1focal amd64
[installed]

The kernel tunnings:

root@maradona:~# cat /etc/sysctl.conf
net.ipv4.tcp_syncookies = 0 # Disable syncookies (syncookies are not RFC
compliant and can use too muche resources)
net.ipv4.tcp_keepalive_time = 600   # Keepalive time for TCP connections
(seconds)
net.ipv4.tcp_synack_retries = 3     # Number of SYNACK retries before
giving up
net.ipv4.tcp_syn_retries = 3        # Number of SYN retries before giving up
net.ipv4.tcp_rfc1337 = 1 # RFC1337 The set to 1 to enable RFC 1337
protection.
net.ipv4.conf.all.log_martians = 1 # Log packets with impossible addresses
to kernel log
net.ipv4.inet_peer_gc_mintime = 5 # Minimum interval between garbage
collection passes This interval is
net.ipv4.tcp_ecn = 0 # Disable Explicit Congestion Notification in TCP
net.ipv4.tcp_window_scaling = 1 # Enable window scaling as defined in
RFC1323
net.ipv4.tcp_timestamps = 1 # Enable timestamps (RFC1323)
net.ipv4.tcp_sack = 1 # Enable select acknowledgments
net.ipv4.tcp_fack = 1 # Enable FACK congestion avoidance and fast
restransmission
net.ipv4.tcp_dsack = 1 # Allows TCP to send "duplicate" SACKs
net.ipv4.ip_forward = 0 # Controls IP packet forwarding
net.ipv4.conf.default.rp_filter = 0 # No controls source route verification
(RFC1812)
net.ipv4.tcp_tw_recycle = 1 # Enable fast recycling TIME-WAIT sockets
net.ipv4.tcp_max_syn_backlog = 20000 # to keep
TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog
net.ipv4.tcp_max_orphans = 412520 # tells the kernel how many TCP sockets
that are not attached to any user file handle to maintain
net.ipv4.tcp_orphan_retries = 1 # How may times to retry before killing TCP
connection, closed by our side
net.ipv4.tcp_fin_timeout = 20 # how long to keep sockets in the state
FIN-WAIT-2 if we were the one closing the socket
net.ipv4.tcp_max_tw_buckets = 33001472 # maximum number of sockets in
TIME-WAIT to be held simultaneously
net.ipv4.tcp_no_metrics_save = 1 # don't cache ssthresh from previous
connection
net.ipv4.tcp_moderate_rcvbuf = 1 # don't cache ssthresh from previous
connection
net.ipv4.tcp_rmem = 4096 87380 16777216 # increase Linux autotuning TCP
buffer limits
net.ipv4.tcp_wmem = 4096 65536 16777216 # increase Linux autotuning TCP
buffer limits
# increase TCP max buffer size
# net.core.rmem_max = 16777216 #try this if you get problems
# net.core.wmem_max = 16777216 #try this if you get problems
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 262144
net.core.wmem_default = 262144
#net.core.netdev_max_backlog = 2500 #try this if you get problems
net.core.netdev_max_backlog = 30000
net.core.somaxconn = 65000
net.ipv6.conf.all.disable_ipv6 = 1 # Disable ipv6
# You can monitor the kernel behavior with regard to the dirty
# pages by using grep -A 1 dirty /proc/vmstat
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15

fs.file-max = 16500736 # system open file limit

# Core dump
kernel.core_pattern = /var/core_dumps/core.%e.%p.%h.%t
fs.suid_dumpable = 2

# Kernel related tunnings
kernel.printk = 4 4 1 7
kernel.core_uses_pid = 1
kernel.sysrq = 0
kernel.msgmax = 65536
kernel.msgmnb = 65536
kernel.shmmax = 243314299699 # Maximum shared segment size in bytes
kernel.shmall = 66003228 # Maximum number of shared memory segments in pages
vm.nr_hugepages = 4096 # Increase Transparent Huge Pages (THP) Defrag:
vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping
vm.min_free_kbytes = 2640129 # required free memory (set to 1% of physical
ram)

iobenchmark result:

Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    write: IOPS=1111, BW=1111MiB/s (1165MB/s)(3072MiB/2764msec); 0
zone resets
 BS=128K  write: IOPS=3812, BW=477MiB/s (500MB/s)(3072MiB/6446msec); 0 zone
resets
 BS=64K   write: IOPS=5116, BW=320MiB/s (335MB/s)(3072MiB/9607msec); 0 zone
resets
 BS=32K   write: IOPS=6545, BW=205MiB/s (214MB/s)(3072MiB/15018msec); 0
zone resets
 BS=16K   write: IOPS=8004, BW=125MiB/s (131MB/s)(3072MiB/24561msec); 0
zone resets
 BS=4K    write: IOPS=8661, BW=33.8MiB/s (35.5MB/s)(3072MiB/90801msec); 0
zone resets
Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    read: IOPS=1117, BW=1117MiB/s (1171MB/s)(3072MiB/2750msec)
 BS=128K  read: IOPS=8353, BW=1044MiB/s (1095MB/s)(3072MiB/2942msec)
 BS=64K   read: IOPS=11.8k, BW=739MiB/s (775MB/s)(3072MiB/4155msec)
 BS=32K   read: IOPS=16.3k, BW=508MiB/s (533MB/s)(3072MiB/6049msec)
 BS=16K   read: IOPS=23.0k, BW=375MiB/s (393MB/s)(3072MiB/8195msec)
 BS=4K    read: IOPS=27.4k, BW=107MiB/s (112MB/s)(3072MiB/28740msec)
Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    write: IOPS=1102, BW=1103MiB/s (1156MB/s)(3072MiB/2786msec); 0
zone resets
 BS=128K  write: IOPS=8581, BW=1073MiB/s (1125MB/s)(3072MiB/2864msec); 0
zone resets
 BS=64K   write: IOPS=10.9k, BW=681MiB/s (714MB/s)(3072MiB/4511msec); 0
zone resets
 BS=32K   write: IOPS=12.1k, BW=378MiB/s (396MB/s)(3072MiB/8129msec); 0
zone resets
 BS=16K   write: IOPS=12.7k, BW=198MiB/s (208MB/s)(3072MiB/15487msec); 0
zone resets
 BS=4K    write: IOPS=12.7k, BW=49.7MiB/s (52.1MB/s)(3072MiB/61848msec); 0
zone resets
Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1M    read: IOPS=1113, BW=1114MiB/s (1168MB/s)(3072MiB/2758msec)
 BS=128K  read: IOPS=8953, BW=1119MiB/s (1173MB/s)(3072MiB/2745msec)
 BS=64K   read: IOPS=17.9k, BW=1116MiB/s (1170MB/s)(3072MiB/2753msec)
 BS=32K   read: IOPS=35.1k, BW=1096MiB/s (1150MB/s)(3072MiB/2802msec)
 BS=16K   read: IOPS=69.4k, BW=1085MiB/s (1138MB/s)(3072MiB/2831msec)
 BS=4K    read: IOPS=112k, BW=438MiB/s (459MB/s)(3072MiB/7015msec)

*Everything looks good except 4K speeds:*
Seq Write  -  BS=4K    write: IOPS=8661, BW=33.8MiB/s
(35.5MB/s)(3072MiB/90801msec); 0 zone resets
Rand Write - BS=4K    write: IOPS=12.7k, BW=49.7MiB/s
(52.1MB/s)(3072MiB/61848msec); 0 zone resets

What do you think?

Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 04:08 tarihinde şunu
yazdı:

> Wow I noticed something!
>
> To prevent ram overflow with gpu training allocations, I'm using a 2TB
> Samsung 870 evo for swap.
>
> As you can see below, swap usage 18Gi and server was idle, that means
> maybe ceph client hits latency because of the swap usage.
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> free -h
>                total        used        free      shared  buff/cache
> available
> Mem:            62Gi        34Gi        27Gi       0.0Ki       639Mi
>  27Gi
> Swap:          1.8Ti        18Gi       1.8Ti
>
> I decided to play around kernel parameters to prevent ceph swap usage.
>
> kernel.shmmax = 60654764851   # Maximum shared segment size in bytes
>> kernel.shmall = 16453658   # Maximum number of shared memory segments in
>> pages
>> vm.nr_hugepages = 4096   # Increase Transparent Huge Pages (THP) Defrag:
>> vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping
>> vm.min_free_kbytes = 1048576 # required free memory (set to 1% of
>> physical ram)
>
>
> I reboot the server and after reboot swap usage is 0 as expected.
>
> To give a try I started the iobench.sh (
> https://github.com/ozkangoksu/benchmark/blob/main/iobench.sh)
> This client has 1G nic only. As you can see below, other then 4K block
> size, ceph client can saturate NIC.
>
> root@bmw-m4:~# nicstat -MUz 1
>     Time      Int   rMbps   wMbps   rPk/s   wPk/s    rAvs    wAvs %rUtil
> %wUtil
> 01:04:48   ens1f0   936.9   92.90 91196.8 60126.3  1346.6   202.5   98.2
> 9.74
>
> root@bmw-m4:/mounts/ud-data/benchuser1/96f13211-c37f-42db-8d05-f3255a05129e/testdir#
> bash iobench.sh
> Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1M    write: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27395msec); 0
> zone resets
>  BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27462msec); 0
> zone resets
>  BS=64K   write: IOPS=1758, BW=110MiB/s (115MB/s)(3072MiB/27948msec); 0
> zone resets
>  BS=32K   write: IOPS=3542, BW=111MiB/s (116MB/s)(3072MiB/27748msec); 0
> zone resets
>  BS=16K   write: IOPS=6839, BW=107MiB/s (112MB/s)(3072MiB/28747msec); 0
> zone resets
>  BS=4K    write: IOPS=8473, BW=33.1MiB/s (34.7MB/s)(3072MiB/92813msec); 0
> zone resets
> Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1M    read: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27386msec)
>  BS=128K  read: IOPS=895, BW=112MiB/s (117MB/s)(3072MiB/27431msec)
>  BS=64K   read: IOPS=1788, BW=112MiB/s (117MB/s)(3072MiB/27486msec)
>  BS=32K   read: IOPS=3561, BW=111MiB/s (117MB/s)(3072MiB/27603msec)
>  BS=16K   read: IOPS=6924, BW=108MiB/s (113MB/s)(3072MiB/28392msec)
>  BS=4K    read: IOPS=21.3k, BW=83.3MiB/s (87.3MB/s)(3072MiB/36894msec)
> Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1M    write: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27406msec); 0
> zone resets
>  BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27466msec); 0
> zone resets
>  BS=64K   write: IOPS=1781, BW=111MiB/s (117MB/s)(3072MiB/27591msec); 0
> zone resets
>  BS=32K   write: IOPS=3545, BW=111MiB/s (116MB/s)(3072MiB/27729msec); 0
> zone resets
>  BS=16K   write: IOPS=6823, BW=107MiB/s (112MB/s)(3072MiB/28814msec); 0
> zone resets
>  BS=4K    write: IOPS=12.7k, BW=49.8MiB/s (52.2MB/s)(3072MiB/61694msec); 0
> zone resets
> Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1M    read: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27388msec)
>  BS=128K  read: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27479msec)
>  BS=64K   read: IOPS=1784, BW=112MiB/s (117MB/s)(3072MiB/27547msec)
>  BS=32K   read: IOPS=3559, BW=111MiB/s (117MB/s)(3072MiB/27614msec)
>  BS=16K   read: IOPS=7047, BW=110MiB/s (115MB/s)(3072MiB/27897msec)
>  BS=4K    read: IOPS=26.9k, BW=105MiB/s (110MB/s)(3072MiB/29199msec)
>
>
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702#
> cat metrics
> item                               total
> ------------------------------------------
> opened files  / total inodes       0 / 109
> pinned i_caps / total inodes       109 / 109
> opened inodes / total inodes       0 / 109
>
> item          total       avg_lat(us)     min_lat(us)     max_lat(us)
> stdev(us)
>
> -----------------------------------------------------------------------------------
> read          2316289     13904           221             8827984
> 760
> write         2317824     21152           2975            9243821
> 2365
> metadata      170         5944            225             202505
>  24314
>
> item          total       avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
>  total_sz(bytes)
>
> ----------------------------------------------------------------------------------------
> read          2316289     16688           4096            1048576
> 38654712361
> write         2317824     19457           4096            4194304
> 45097156608
>
> item          total           miss            hit
> -------------------------------------------------
> d_lease       112             3               858
> caps          109             58              6963547
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702#
> free -h
>                total        used        free      shared  buff/cache
> available
> Mem:            62Gi        11Gi        50Gi       3.0Mi       1.0Gi
>  49Gi
> Swap:          1.8Ti          0B       1.8Ti
>
>
> I started to feel we are getting closer :)
>
>
>
> Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 02:58 tarihinde şunu
> yazdı:
>
>> I started to investigate my clients.
>>
>> for example:
>>
>> root@ud-01:~# ceph health detail
>> HEALTH_WARN 1 clients failing to respond to cache pressure
>> [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
>>     mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to respond to
>> cache pressure client_id: 1275577
>>
>> root@ud-01:~# ceph fs status
>> ud-data - 86 clients
>> =======
>> RANK  STATE           MDS              ACTIVITY     DNS    INOS   DIRS
>> CAPS
>>  0    active  ud-data.ud-02.xcoojt  Reqs:   34 /s  2926k  2827k   155k
>>  1157k
>>
>>
>> ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] | "clientid:
>> \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases),
>> request_load_avg: \(.request_load_avg), num_completed_requests:
>> \(.num_completed_requests), num_completed_flushes:
>> \(.num_completed_flushes)"' | sort -n -t: -k3
>>
>> clientid: *1275577*= num_caps: 12312, num_leases: 0, request_load_avg:
>> 0, num_completed_requests: 0, num_completed_flushes: 1
>> clientid: 1275571= num_caps: 16307, num_leases: 1, request_load_avg:
>> 2101, num_completed_requests: 0, num_completed_flushes: 3
>> clientid: 1282130= num_caps: 26337, num_leases: 3, request_load_avg: 116,
>> num_completed_requests: 0, num_completed_flushes: 1
>> clientid: 1191789= num_caps: 32784, num_leases: 0, request_load_avg:
>> 1846, num_completed_requests: 0, num_completed_flushes: 0
>> clientid: 1275535= num_caps: 79825, num_leases: 2, request_load_avg: 133,
>> num_completed_requests: 8, num_completed_flushes: 8
>> clientid: 1282142= num_caps: 80581, num_leases: 6, request_load_avg: 125,
>> num_completed_requests: 2, num_completed_flushes: 6
>> clientid: 1275532= num_caps: 87836, num_leases: 3, request_load_avg: 190,
>> num_completed_requests: 2, num_completed_flushes: 6
>> clientid: 1275547= num_caps: 94129, num_leases: 4, request_load_avg: 149,
>> num_completed_requests: 2, num_completed_flushes: 4
>> clientid: 1275553= num_caps: 96460, num_leases: 4, request_load_avg: 155,
>> num_completed_requests: 2, num_completed_flushes: 8
>> clientid: 1282139= num_caps: 108882, num_leases: 25, request_load_avg:
>> 99, num_completed_requests: 2, num_completed_flushes: 4
>> clientid: 1275538= num_caps: 437162, num_leases: 0, request_load_avg:
>> 101, num_completed_requests: 2, num_completed_flushes: 0
>>
>> --------------------------------------
>>
>> *MY CLIENT:*
>>
>> The client is actually at idle mode and there is no reason to fail at
>> all.
>>
>> root@bmw-m4:~# apt list --installed |grep ceph
>> ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 [installed]
>> libcephfs2/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
>> [installed,automatic]
>> python3-ceph-argparse/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
>> [installed,automatic]
>> python3-ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 all
>> [installed,automatic]
>> python3-cephfs/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
>> [installed,automatic]
>>
>> Let's check metrics and stats:
>>
>> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
>> cat metrics
>> item                               total
>> ------------------------------------------
>> opened files  / total inodes       2 / 12312
>> pinned i_caps / total inodes       12312 / 12312
>> opened inodes / total inodes       1 / 12312
>>
>> item          total       avg_lat(us)     min_lat(us)     max_lat(us)
>> stdev(us)
>>
>> -----------------------------------------------------------------------------------
>> read          22283       44409           430             1804853
>> 15619
>> write         112702      419725          3658            8879541
>> 6008
>> metadata      353322      5712            154             917903
>>  5357
>>
>> item          total       avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
>>  total_sz(bytes)
>>
>> ----------------------------------------------------------------------------------------
>> read          22283       1701940         1               4194304
>> 37924318602
>> write         112702      246211          1               4194304
>> 27748469309
>>
>> item          total           miss            hit
>> -------------------------------------------------
>> d_lease       62              63627           28564698
>> caps          12312           36658           44568261
>>
>>
>> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
>> cat bdi/stats
>> BdiWriteback:                0 kB
>> BdiReclaimable:            800 kB
>> BdiDirtyThresh:              0 kB
>> DirtyThresh:           5795340 kB
>> BackgroundThresh:      2894132 kB
>> BdiDirtied:           27316320 kB
>> BdiWritten:           27316320 kB
>> BdiWriteBandwidth:        1472 kBps
>> b_dirty:                     0
>> b_io:                        0
>> b_more_io:                   0
>> b_dirty_time:                0
>> bdi_list:                    1
>> state:                       1
>>
>>
>> Last 3 days dmesg output:
>>
>> [Wed Jan 24 16:45:13 2024] xfsettingsd[653036]: segfault at 18 ip
>> 00007fbd12f5d337 sp 00007ffd254332a0 error 4 in
>> libxklavier.so.16.4.0[7fbd12f4d000+19000]
>> [Wed Jan 24 16:45:13 2024] Code: 4c 89 e7 e8 0b 56 ff ff 48 89 03 48 8b
>> 5c 24 30 e9 d1 fd ff ff e8 b9 5b ff ff 66 0f 1f 84 00 00 00 00 00 41 54 55
>> 48 89 f5 53 <48> 8b 42 18 48 89 d1 49 89 fc 48 89 d3 48 89 fa 48 89 ef 48
>> 8b b0
>> [Thu Jan 25 06:51:31 2024] NVRM: GPU at PCI:0000:81:00:
>> GPU-02efbb18-c9e4-3a16-d615-598959520b99
>> [Thu Jan 25 06:51:31 2024] NVRM: GPU Board Serial Number: 1321421015411
>> [Thu Jan 25 06:51:31 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=683281,
>> name=python, Ch 00000008
>> [Thu Jan 25 06:56:49 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=683377,
>> name=python, Ch 00000018
>> [Thu Jan 25 20:14:13 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=696062,
>> name=python, Ch 00000008
>> [Fri Jan 26 04:05:40 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=700166,
>> name=python, Ch 00000008
>> [Fri Jan 26 05:05:12 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=700320,
>> name=python, Ch 00000008
>> [Fri Jan 26 05:44:50 2024] NVRM: GPU at PCI:0000:82:00:
>> GPU-3af62a2c-e7eb-a7d5-c073-22f06dc7065f
>> [Fri Jan 26 05:44:50 2024] NVRM: GPU Board Serial Number: 1321421010400
>> [Fri Jan 26 05:44:50 2024] NVRM: Xid (PCI:0000:82:00): 43, pid=700757,
>> name=python, Ch 00000018
>> [Fri Jan 26 05:56:02 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=701096,
>> name=python, Ch 00000028
>> [Fri Jan 26 06:34:20 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=701226,
>> name=python, Ch 00000038
>>
>> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
>> free -h
>>                total        used        free      shared  buff/cache
>> available
>> Mem:            62Gi        34Gi        27Gi       0.0Ki       639Mi
>>    27Gi
>> Swap:          1.8Ti        18Gi       1.8Ti
>>
>> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
>> cat /proc/vmstat
>> nr_free_pages 7231171
>> nr_zone_inactive_anon 7924766
>> nr_zone_active_anon 525190
>> nr_zone_inactive_file 44029
>> nr_zone_active_file 55966
>> nr_zone_unevictable 13042
>> nr_zone_write_pending 3
>> nr_mlock 13042
>> nr_bounce 0
>> nr_zspages 0
>> nr_free_cma 0
>> numa_hit 6701928919
>> numa_miss 312628341
>> numa_foreign 312628341
>> numa_interleave 31538
>> numa_local 6701864751
>> numa_other 312692567
>> nr_inactive_anon 7924766
>> nr_active_anon 525190
>> nr_inactive_file 44029
>> nr_active_file 55966
>> nr_unevictable 13042
>> nr_slab_reclaimable 61076
>> nr_slab_unreclaimable 63509
>> nr_isolated_anon 0
>> nr_isolated_file 0
>> workingset_nodes 3934
>> workingset_refault_anon 30325493
>> workingset_refault_file 14593094
>> workingset_activate_anon 5376050
>> workingset_activate_file 3250679
>> workingset_restore_anon 292317
>> workingset_restore_file 1166673
>> workingset_nodereclaim 488665
>> nr_anon_pages 8451968
>> nr_mapped 35731
>> nr_file_pages 138824
>> nr_dirty 3
>> nr_writeback 0
>> nr_writeback_temp 0
>> nr_shmem 242
>> nr_shmem_hugepages 0
>> nr_shmem_pmdmapped 0
>> nr_file_hugepages 0
>> nr_file_pmdmapped 0
>> nr_anon_transparent_hugepages 3588
>> nr_vmscan_write 33746573
>> nr_vmscan_immediate_reclaim 160
>> nr_dirtied 48165341
>> nr_written 80207893
>> nr_kernel_misc_reclaimable 0
>> nr_foll_pin_acquired 174002
>> nr_foll_pin_released 174002
>> nr_kernel_stack 60032
>> nr_page_table_pages 46041
>> nr_swapcached 36166
>> nr_dirty_threshold 1448010
>> nr_dirty_background_threshold 723121
>> pgpgin 129904699
>> pgpgout 299261581
>> pswpin 30325493
>> pswpout 45158221
>> pgalloc_dma 1024
>> pgalloc_dma32 57788566
>> pgalloc_normal 6956384725
>> pgalloc_movable 0
>> allocstall_dma 0
>> allocstall_dma32 0
>> allocstall_normal 188
>> allocstall_movable 63024
>> pgskip_dma 0
>> pgskip_dma32 0
>> pgskip_normal 0
>> pgskip_movable 0
>> pgfree 7222273815
>> pgactivate 1371753960
>> pgdeactivate 18329381
>> pglazyfree 10
>> pgfault 7795723861
>> pgmajfault 4600007
>> pglazyfreed 0
>> pgrefill 18575528
>> pgreuse 81910383
>> pgsteal_kswapd 980532060
>> pgsteal_direct 38942066
>> pgdemote_kswapd 0
>> pgdemote_direct 0
>> pgscan_kswapd 1135293298
>> pgscan_direct 58883653
>> pgscan_direct_throttle 15
>> pgscan_anon 220939938
>> pgscan_file 973237013
>> pgsteal_anon 46538607
>> pgsteal_file 972935519
>> zone_reclaim_failed 0
>> pginodesteal 0
>> slabs_scanned 25879882
>> kswapd_inodesteal 2179831
>> kswapd_low_wmark_hit_quickly 152797
>> kswapd_high_wmark_hit_quickly 32025
>> pageoutrun 204447
>> pgrotated 44963935
>> drop_pagecache 0
>> drop_slab 0
>> oom_kill 0
>> numa_pte_updates 2724410955
>> numa_huge_pte_updates 1695890
>> numa_hint_faults 1739823254
>> numa_hint_faults_local 1222358972
>> numa_pages_migrated 312611639
>> pgmigrate_success 510846802
>> pgmigrate_fail 875493
>> thp_migration_success 156413
>> thp_migration_fail 2
>> thp_migration_split 0
>> compact_migrate_scanned 1274073243
>> compact_free_scanned 8430842597
>> compact_isolated 400278352
>> compact_stall 145300
>> compact_fail 128562
>> compact_success 16738
>> compact_daemon_wake 170247
>> compact_daemon_migrate_scanned 35486283
>> compact_daemon_free_scanned 369870412
>> htlb_buddy_alloc_success 0
>> htlb_buddy_alloc_fail 0
>> unevictable_pgs_culled 2774290
>> unevictable_pgs_scanned 0
>> unevictable_pgs_rescued 2675031
>> unevictable_pgs_mlocked 2813622
>> unevictable_pgs_munlocked 2674972
>> unevictable_pgs_cleared 84231
>> unevictable_pgs_stranded 84225
>> thp_fault_alloc 416468
>> thp_fault_fallback 19181
>> thp_fault_fallback_charge 0
>> thp_collapse_alloc 17931
>> thp_collapse_alloc_failed 76
>> thp_file_alloc 0
>> thp_file_fallback 0
>> thp_file_fallback_charge 0
>> thp_file_mapped 0
>> thp_split_page 2
>> thp_split_page_failed 0
>> thp_deferred_split_page 66
>> thp_split_pmd 22451
>> thp_split_pud 0
>> thp_zero_page_alloc 1
>> thp_zero_page_alloc_failed 0
>> thp_swpout 22332
>> thp_swpout_fallback 0
>> balloon_inflate 0
>> balloon_deflate 0
>> balloon_migrate 0
>> swap_ra 25777929
>> swap_ra_hit 25658825
>> direct_map_level2_splits 1249
>> direct_map_level3_splits 49
>> nr_unstable 0
>>
>>
>>
>> Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 02:36 tarihinde şunu
>> yazdı:
>>
>>> Hello Frank.
>>>
>>> I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel:
>>> Linux 5.4.0-125-generic
>>>
>>> My cluster 17.2.6 quincy.
>>> I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I
>>> wonder using new version clients is the main problem?
>>> Maybe I have a communication error. For example I hit this problem and I
>>> can not collect client stats "
>>> https://github.com/ceph/ceph/pull/52127/files";
>>>
>>> Best regards.
>>>
>>>
>>>
>>> Frank Schilder <frans@xxxxxx>, 26 Oca 2024 Cum, 14:53 tarihinde şunu
>>> yazdı:
>>>
>>>> Hi, this message is one of those that are often spurious. I don't
>>>> recall in which thread/PR/tracker I read it, but the story was something
>>>> like that:
>>>>
>>>> If an MDS gets under memory pressure it will request dentry items back
>>>> from *all* clients, not just the active ones or the ones holding many of
>>>> them. If you have a client that's below the min-threshold for dentries (its
>>>> one of the client/mds tuning options), it will not respond. This client
>>>> will be flagged as not responding, which is a false positive.
>>>>
>>>> I believe the devs are working on a fix to get rid of these spurious
>>>> warnings. There is a "bug/feature" in the MDS that does not clear this
>>>> warning flag for inactive clients. Hence, the message hangs and never
>>>> disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches"
>>>> on the client. However, except for being annoying in the dashboard, it has
>>>> no performance or otherwise negative impact.
>>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Eugen Block <eblock@xxxxxx>
>>>> Sent: Friday, January 26, 2024 10:05 AM
>>>> To: Özkan Göksu
>>>> Cc: ceph-users@xxxxxxx
>>>> Subject:  Re: 1 clients failing to respond to cache
>>>> pressure (quincy:17.2.6)
>>>>
>>>> Performance for small files is more about IOPS rather than throughput,
>>>> and the IOPS in your fio tests look okay to me. What you could try is
>>>> to split the PGs to get around 150 or 200 PGs per OSD. You're
>>>> currently at around 60 according to the ceph osd df output. Before you
>>>> do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
>>>> head'? I don't need the whole output, just to see how many objects
>>>> each PG has. We had a case once where that helped, but it was an older
>>>> cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
>>>> So this might not be the solution here, but it could improve things as
>>>> well.
>>>>
>>>>
>>>> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>:
>>>>
>>>> > Every user has a 1x subvolume and I only have 1 pool.
>>>> > At the beginning we were using each subvolume for ldap home directory
>>>> +
>>>> > user data.
>>>> > When a user logins any docker on any host, it was using the cluster
>>>> for
>>>> > home and the for user related data, we was have second directory in
>>>> the
>>>> > same subvolume.
>>>> > Time to time users were feeling a very slow home environment and
>>>> after a
>>>> > month it became almost impossible to use home. VNC sessions became
>>>> > unresponsive and slow etc.
>>>> >
>>>> > 2 weeks ago, I had to migrate home to a ZFS storage and now the
>>>> overall
>>>> > performance is better for only user_data without home.
>>>> > But still the performance is not good enough as I expected because of
>>>> the
>>>> > problems related to MDS.
>>>> > The usage is low but allocation is high and Cpu usage is high. You
>>>> saw the
>>>> > IO Op/s, it's nothing but allocation is high.
>>>> >
>>>> > I develop a fio benchmark script and I run the script on 4x test
>>>> server at
>>>> > the same time, the results are below:
>>>> > Script:
>>>> >
>>>> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
>>>> >
>>>> >
>>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
>>>> >
>>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
>>>> >
>>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
>>>> >
>>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
>>>> >
>>>> > While running benchmark, I take sample values for each type of
>>>> iobench run.
>>>> >
>>>> > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>>> >     client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
>>>> >     client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
>>>> >     client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
>>>> >
>>>> > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>>> >     client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
>>>> >     client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
>>>> >
>>>> > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>>> >     client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
>>>> >     client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
>>>> >     client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr
>>>> >
>>>> > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>>>> >     client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
>>>> >     client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
>>>> >     client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
>>>> >     client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr
>>>> >
>>>> > It seems I only have problems with the 4K,8K,16K other sector sizes.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Eugen Block <eblock@xxxxxx>, 25 Oca 2024 Per, 19:06 tarihinde şunu
>>>> yazdı:
>>>> >
>>>> >> I understand that your MDS shows a high CPU usage, but other than
>>>> that
>>>> >> what is your performance issue? Do users complain? Do some operations
>>>> >> take longer than expected? Are OSDs saturated during those phases?
>>>> >> Because the cache pressure messages don’t necessarily mean that users
>>>> >> will notice.
>>>> >> MDS daemons are single-threaded so that might be a bottleneck. In
>>>> that
>>>> >> case multi-active mds might help, which you already tried and
>>>> >> experienced OOM killers. But you might have to disable the mds
>>>> >> balancer as someone else mentioned. And then you could think about
>>>> >> pinning, is it possible to split the CephFS into multiple
>>>> >> subdirectories and pin them to different ranks?
>>>> >> But first I’d still like to know what the performance issue really
>>>> is.
>>>> >>
>>>> >> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>:
>>>> >>
>>>> >> > I will try my best to explain my situation.
>>>> >> >
>>>> >> > I don't have a separate mds server. I have 5 identical nodes, 3 of
>>>> them
>>>> >> > mons, and I use the other 2 as active and standby mds. (currently
>>>> I have
>>>> >> > left overs from max_mds 4)
>>>> >> >
>>>> >> > root@ud-01:~# ceph -s
>>>> >> >   cluster:
>>>> >> >     id:     e42fd4b0-313b-11ee-9a00-31da71873773
>>>> >> >     health: HEALTH_WARN
>>>> >> >             1 clients failing to respond to cache pressure
>>>> >> >
>>>> >> >   services:
>>>> >> >     mon: 3 daemons, quorum ud-01,ud-02,ud-03 (age 9d)
>>>> >> >     mgr: ud-01.qycnol(active, since 8d), standbys: ud-02.tfhqfd
>>>> >> >     mds: 1/1 daemons up, 4 standby
>>>> >> >     osd: 80 osds: 80 up (since 9d), 80 in (since 5M)
>>>> >> >
>>>> >> >   data:
>>>> >> >     volumes: 1/1 healthy
>>>> >> >     pools:   3 pools, 2305 pgs
>>>> >> >     objects: 106.58M objects, 25 TiB
>>>> >> >     usage:   45 TiB used, 101 TiB / 146 TiB avail
>>>> >> >     pgs:     2303 active+clean
>>>> >> >              2    active+clean+scrubbing+deep
>>>> >> >
>>>> >> >   io:
>>>> >> >     client:   16 MiB/s rd, 3.4 MiB/s wr, 77 op/s rd, 23 op/s wr
>>>> >> >
>>>> >> > ------------------------------
>>>> >> > root@ud-01:~# ceph fs status
>>>> >> > ud-data - 84 clients
>>>> >> > =======
>>>> >> > RANK  STATE           MDS              ACTIVITY     DNS    INOS
>>>>  DIRS
>>>> >> > CAPS
>>>> >> >  0    active  ud-data.ud-02.xcoojt  Reqs:   40 /s  2579k  2578k
>>>>  169k
>>>> >> >  3048k
>>>> >> >         POOL           TYPE     USED  AVAIL
>>>> >> > cephfs.ud-data.meta  metadata   136G  44.9T
>>>> >> > cephfs.ud-data.data    data    44.3T  44.9T
>>>> >> >
>>>> >> > ------------------------------
>>>> >> > root@ud-01:~# ceph health detail
>>>> >> > HEALTH_WARN 1 clients failing to respond to cache pressure
>>>> >> > [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache
>>>> pressure
>>>> >> >     mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to
>>>> respond to
>>>> >> > cache pressure client_id: 1275577
>>>> >> >
>>>> >> > ------------------------------
>>>> >> > When I check the failing client with session ls I see only
>>>> "num_caps:
>>>> >> 12298"
>>>> >> >
>>>> >> > ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] |
>>>> "clientid:
>>>> >> > \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases),
>>>> >> > request_load_avg: \(.request_load_avg), num_completed_requests:
>>>> >> > \(.num_completed_requests), num_completed_flushes:
>>>> >> > \(.num_completed_flushes)"' | sort -n -t: -k3
>>>> >> >
>>>> >> > clientid: 1275577= num_caps: 12298, num_leases: 0,
>>>> request_load_avg: 0,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>>> >> > clientid: 1294542= num_caps: 13000, num_leases: 12,
>>>> request_load_avg:
>>>> >> 105,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 6
>>>> >> > clientid: 1282187= num_caps: 16869, num_leases: 1,
>>>> request_load_avg: 0,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>>> >> > clientid: 1275589= num_caps: 18943, num_leases: 0,
>>>> request_load_avg: 52,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>>> >> > clientid: 1282154= num_caps: 24747, num_leases: 1,
>>>> request_load_avg: 57,
>>>> >> > num_completed_requests: 2, num_completed_flushes: 2
>>>> >> > clientid: 1275553= num_caps: 25120, num_leases: 2,
>>>> request_load_avg: 116,
>>>> >> > num_completed_requests: 2, num_completed_flushes: 8
>>>> >> > clientid: 1282142= num_caps: 27185, num_leases: 6,
>>>> request_load_avg: 128,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 8
>>>> >> > clientid: 1275535= num_caps: 40364, num_leases: 6,
>>>> request_load_avg: 111,
>>>> >> > num_completed_requests: 2, num_completed_flushes: 8
>>>> >> > clientid: 1282130= num_caps: 41483, num_leases: 0,
>>>> request_load_avg: 135,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>>> >> > clientid: 1275547= num_caps: 42953, num_leases: 4,
>>>> request_load_avg: 119,
>>>> >> > num_completed_requests: 2, num_completed_flushes: 6
>>>> >> > clientid: 1282139= num_caps: 45435, num_leases: 27,
>>>> request_load_avg: 84,
>>>> >> > num_completed_requests: 2, num_completed_flushes: 34
>>>> >> > clientid: 1282136= num_caps: 48374, num_leases: 8,
>>>> request_load_avg: 0,
>>>> >> > num_completed_requests: 1, num_completed_flushes: 1
>>>> >> > clientid: 1275532= num_caps: 48664, num_leases: 7,
>>>> request_load_avg: 115,
>>>> >> > num_completed_requests: 2, num_completed_flushes: 8
>>>> >> > clientid: 1191789= num_caps: 130319, num_leases: 0,
>>>> request_load_avg:
>>>> >> 1753,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 0
>>>> >> > clientid: 1275571= num_caps: 139488, num_leases: 0,
>>>> request_load_avg: 2,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>>> >> > clientid: 1282133= num_caps: 145487, num_leases: 0,
>>>> request_load_avg: 8,
>>>> >> > num_completed_requests: 1, num_completed_flushes: 1
>>>> >> > clientid: 1534496= num_caps: 1041316, num_leases: 0,
>>>> request_load_avg: 0,
>>>> >> > num_completed_requests: 0, num_completed_flushes: 1
>>>> >> >
>>>> >> > ------------------------------
>>>> >> > When I check the dashboard/service/mds I see %120+ CPU usage on
>>>> active
>>>> >> MDS
>>>> >> > but on the host everything is almost idle and disk waits are very
>>>> low.
>>>> >> >
>>>> >> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>> >> >            0.61    0.00    0.38    0.41    0.00   98.60
>>>> >> >
>>>> >> > Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
>>>>    w/s
>>>> >> >   wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s
>>>> >> %drqm
>>>> >> > d_await dareq-sz     f/s f_await  aqu-sz  %util
>>>> >> > sdc              2.00      0.01     0.00   0.00    0.50     6.00
>>>>  20.00
>>>> >> >    0.04     0.00   0.00    0.50     2.00    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00   10.00    0.60    0.02   1.20
>>>> >> > sdd              3.00      0.02     0.00   0.00    0.67     8.00
>>>> 285.00
>>>> >> >    1.84    77.00  21.27    0.44     6.61    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00  114.00    0.83    0.22  22.40
>>>> >> > sde              1.00      0.01     0.00   0.00    1.00     8.00
>>>>  36.00
>>>> >> >    0.08     3.00   7.69    0.64     2.33    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00   18.00    0.67    0.04   1.60
>>>> >> > sdf              5.00      0.04     0.00   0.00    0.40     7.20
>>>>  40.00
>>>> >> >    0.09     3.00   6.98    0.53     2.30    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00   20.00    0.70    0.04   2.00
>>>> >> > sdg             11.00      0.08     0.00   0.00    0.73     7.27
>>>>  36.00
>>>> >> >    0.09     4.00  10.00    0.50     2.44    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00   18.00    0.72    0.04   3.20
>>>> >> > sdh              5.00      0.03     0.00   0.00    0.60     5.60
>>>>  46.00
>>>> >> >    0.10     2.00   4.17    0.59     2.17    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00   23.00    0.83    0.05   2.80
>>>> >> > sdi              7.00      0.04     0.00   0.00    0.43     6.29
>>>>  36.00
>>>> >> >    0.07     1.00   2.70    0.47     2.11    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00   18.00    0.61    0.03   2.40
>>>> >> > sdj              5.00      0.04     0.00   0.00    0.80     7.20
>>>>  42.00
>>>> >> >    0.09     1.00   2.33    0.67     2.10    0.00      0.00     0.00
>>>> >>  0.00
>>>> >> >    0.00     0.00   21.00    0.81    0.05   3.20
>>>> >> >
>>>> >> > ------------------------------
>>>> >> > Other than this 5x node cluster, I also have a 3x node cluster with
>>>> >> > identical hardware but it serves for a different purpose and data
>>>> >> workload.
>>>> >> > In this cluster I don't have any problem and MDS default settings
>>>> seems
>>>> >> > enough.
>>>> >> > The only difference between two cluster is, 5x node cluster used
>>>> directly
>>>> >> > by users, 3x node cluster used heavily to read and write data via
>>>> >> projects
>>>> >> > not by users. So allocate and de-allocate will be better.
>>>> >> >
>>>> >> > I guess I just have a problematic use case on the 5x node cluster
>>>> and as
>>>> >> I
>>>> >> > mentioned above, I might have the similar problem but I don't know
>>>> how to
>>>> >> > debug it.
>>>> >> >
>>>> >> >
>>>> >>
>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/
>>>> >> > quote:"A user running VSCodium, keeping 15k caps open.. the
>>>> opportunistic
>>>> >> > caps recall eventually starts recalling those but the (el7 kernel)
>>>> client
>>>> >> > won't release them. Stopping Codium seems to be the only way to
>>>> release."
>>>> >> >
>>>> >> > ------------------------------
>>>> >> > Before reading the osd df you should know that I created 2x
>>>> >> > OSD/per"CT4000MX500SSD1"
>>>> >> > # ceph osd df tree
>>>> >> > ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA     OMAP
>>>> >> META
>>>> >> >     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
>>>> >> >  -1         145.54321         -  146 TiB   45 TiB   44 TiB   119
>>>> GiB  333
>>>> >> > GiB  101 TiB  30.81  1.00    -          root default
>>>> >> >  -3          29.10864         -   29 TiB  8.9 TiB  8.8 TiB    25
>>>> GiB   66
>>>> >> > GiB   20 TiB  30.54  0.99    -              host ud-01
>>>> >> >   0    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.4
>>>> GiB  4.5
>>>> >> > GiB  1.2 TiB  33.04  1.07   61      up          osd.0
>>>> >> >   1    ssd    1.81929   1.00000  1.8 TiB  527 GiB  521 GiB   1.5
>>>> GiB  4.0
>>>> >> > GiB  1.3 TiB  28.28  0.92   53      up          osd.1
>>>> >> >   2    ssd    1.81929   1.00000  1.8 TiB  595 GiB  589 GiB   2.3
>>>> GiB  4.0
>>>> >> > GiB  1.2 TiB  31.96  1.04   63      up          osd.2
>>>> >> >   3    ssd    1.81929   1.00000  1.8 TiB  527 GiB  521 GiB   1.8
>>>> GiB  4.2
>>>> >> > GiB  1.3 TiB  28.30  0.92   55      up          osd.3
>>>> >> >   4    ssd    1.81929   1.00000  1.8 TiB  525 GiB  520 GiB   1.3
>>>> GiB  3.9
>>>> >> > GiB  1.3 TiB  28.21  0.92   52      up          osd.4
>>>> >> >   5    ssd    1.81929   1.00000  1.8 TiB  592 GiB  586 GiB   1.8
>>>> GiB  3.8
>>>> >> > GiB  1.2 TiB  31.76  1.03   61      up          osd.5
>>>> >> >   6    ssd    1.81929   1.00000  1.8 TiB  559 GiB  553 GiB   1.8
>>>> GiB  4.3
>>>> >> > GiB  1.3 TiB  30.03  0.97   57      up          osd.6
>>>> >> >   7    ssd    1.81929   1.00000  1.8 TiB  602 GiB  597 GiB   836
>>>> MiB  4.4
>>>> >> > GiB  1.2 TiB  32.32  1.05   58      up          osd.7
>>>> >> >   8    ssd    1.81929   1.00000  1.8 TiB  614 GiB  609 GiB   1.2
>>>> GiB  4.5
>>>> >> > GiB  1.2 TiB  32.98  1.07   60      up          osd.8
>>>> >> >   9    ssd    1.81929   1.00000  1.8 TiB  571 GiB  565 GiB   2.2
>>>> GiB  4.2
>>>> >> > GiB  1.3 TiB  30.67  1.00   61      up          osd.9
>>>> >> >  10    ssd    1.81929   1.00000  1.8 TiB  528 GiB  522 GiB   1.3
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  28.33  0.92   52      up          osd.10
>>>> >> >  11    ssd    1.81929   1.00000  1.8 TiB  551 GiB  546 GiB   1.5
>>>> GiB  3.6
>>>> >> > GiB  1.3 TiB  29.57  0.96   56      up          osd.11
>>>> >> >  12    ssd    1.81929   1.00000  1.8 TiB  594 GiB  588 GiB   1.8
>>>> GiB  4.4
>>>> >> > GiB  1.2 TiB  31.91  1.04   61      up          osd.12
>>>> >> >  13    ssd    1.81929   1.00000  1.8 TiB  561 GiB  555 GiB   1.1
>>>> GiB  4.3
>>>> >> > GiB  1.3 TiB  30.10  0.98   55      up          osd.13
>>>> >> >  14    ssd    1.81929   1.00000  1.8 TiB  616 GiB  609 GiB   1.9
>>>> GiB  4.2
>>>> >> > GiB  1.2 TiB  33.04  1.07   64      up          osd.14
>>>> >> >  15    ssd    1.81929   1.00000  1.8 TiB  525 GiB  520 GiB   1.1
>>>> GiB  4.0
>>>> >> > GiB  1.3 TiB  28.20  0.92   51      up          osd.15
>>>> >> >  -5          29.10864         -   29 TiB  9.0 TiB  8.9 TiB    22
>>>> GiB   67
>>>> >> > GiB   20 TiB  30.89  1.00    -              host ud-02
>>>> >> >  16    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.7
>>>> GiB  4.7
>>>> >> > GiB  1.2 TiB  33.12  1.08   63      up          osd.16
>>>> >> >  17    ssd    1.81929   1.00000  1.8 TiB  582 GiB  577 GiB   1.6
>>>> GiB  4.0
>>>> >> > GiB  1.3 TiB  31.26  1.01   59      up          osd.17
>>>> >> >  18    ssd    1.81929   1.00000  1.8 TiB  583 GiB  578 GiB   418
>>>> MiB  4.0
>>>> >> > GiB  1.3 TiB  31.29  1.02   54      up          osd.18
>>>> >> >  19    ssd    1.81929   1.00000  1.8 TiB  550 GiB  544 GiB   1.5
>>>> GiB  4.0
>>>> >> > GiB  1.3 TiB  29.50  0.96   56      up          osd.19
>>>> >> >  20    ssd    1.81929   1.00000  1.8 TiB  551 GiB  546 GiB   1.1
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  29.57  0.96   54      up          osd.20
>>>> >> >  21    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.3
>>>> GiB  4.4
>>>> >> > GiB  1.2 TiB  33.04  1.07   60      up          osd.21
>>>> >> >  22    ssd    1.81929   1.00000  1.8 TiB  573 GiB  567 GiB   1.6
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  30.75  1.00   58      up          osd.22
>>>> >> >  23    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.3
>>>> GiB  4.3
>>>> >> > GiB  1.2 TiB  33.06  1.07   60      up          osd.23
>>>> >> >  24    ssd    1.81929   1.00000  1.8 TiB  539 GiB  534 GiB   844
>>>> MiB  3.8
>>>> >> > GiB  1.3 TiB  28.92  0.94   51      up          osd.24
>>>> >> >  25    ssd    1.81929   1.00000  1.8 TiB  583 GiB  576 GiB   2.1
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  31.27  1.02   61      up          osd.25
>>>> >> >  26    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.3
>>>> GiB  4.6
>>>> >> > GiB  1.2 TiB  33.12  1.08   61      up          osd.26
>>>> >> >  27    ssd    1.81929   1.00000  1.8 TiB  537 GiB  532 GiB   1.2
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  28.84  0.94   53      up          osd.27
>>>> >> >  28    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.3
>>>> GiB  4.2
>>>> >> > GiB  1.3 TiB  28.29  0.92   53      up          osd.28
>>>> >> >  29    ssd    1.81929   1.00000  1.8 TiB  594 GiB  588 GiB   1.5
>>>> GiB  4.6
>>>> >> > GiB  1.2 TiB  31.91  1.04   59      up          osd.29
>>>> >> >  30    ssd    1.81929   1.00000  1.8 TiB  528 GiB  523 GiB   1.4
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  28.35  0.92   53      up          osd.30
>>>> >> >  31    ssd    1.81929   1.00000  1.8 TiB  594 GiB  589 GiB   1.6
>>>> GiB  3.8
>>>> >> > GiB  1.2 TiB  31.89  1.03   61      up          osd.31
>>>> >> >  -7          29.10864         -   29 TiB  8.9 TiB  8.8 TiB    23
>>>> GiB   67
>>>> >> > GiB   20 TiB  30.66  1.00    -              host ud-03
>>>> >> >  32    ssd    1.81929   1.00000  1.8 TiB  593 GiB  588 GiB   1.1
>>>> GiB  4.3
>>>> >> > GiB  1.2 TiB  31.84  1.03   57      up          osd.32
>>>> >> >  33    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.8
>>>> GiB  4.4
>>>> >> > GiB  1.2 TiB  33.13  1.08   63      up          osd.33
>>>> >> >  34    ssd    1.81929   1.00000  1.8 TiB  537 GiB  532 GiB   2.0
>>>> GiB  3.8
>>>> >> > GiB  1.3 TiB  28.84  0.94   59      up          osd.34
>>>> >> >  35    ssd    1.81929   1.00000  1.8 TiB  562 GiB  556 GiB   1.7
>>>> GiB  4.2
>>>> >> > GiB  1.3 TiB  30.16  0.98   58      up          osd.35
>>>> >> >  36    ssd    1.81929   1.00000  1.8 TiB  529 GiB  523 GiB   1.3
>>>> GiB  3.9
>>>> >> > GiB  1.3 TiB  28.38  0.92   52      up          osd.36
>>>> >> >  37    ssd    1.81929   1.00000  1.8 TiB  527 GiB  521 GiB   1.7
>>>> GiB  4.2
>>>> >> > GiB  1.3 TiB  28.28  0.92   55      up          osd.37
>>>> >> >  38    ssd    1.81929   1.00000  1.8 TiB  574 GiB  568 GiB   1.2
>>>> GiB  4.3
>>>> >> > GiB  1.3 TiB  30.79  1.00   55      up          osd.38
>>>> >> >  39    ssd    1.81929   1.00000  1.8 TiB  605 GiB  599 GiB   1.6
>>>> GiB  4.2
>>>> >> > GiB  1.2 TiB  32.48  1.05   61      up          osd.39
>>>> >> >  40    ssd    1.81929   1.00000  1.8 TiB  573 GiB  567 GiB   1.2
>>>> GiB  4.4
>>>> >> > GiB  1.3 TiB  30.76  1.00   56      up          osd.40
>>>> >> >  41    ssd    1.81929   1.00000  1.8 TiB  526 GiB  520 GiB   1.7
>>>> GiB  3.9
>>>> >> > GiB  1.3 TiB  28.21  0.92   54      up          osd.41
>>>> >> >  42    ssd    1.81929   1.00000  1.8 TiB  613 GiB  608 GiB  1010
>>>> MiB  4.4
>>>> >> > GiB  1.2 TiB  32.91  1.07   58      up          osd.42
>>>> >> >  43    ssd    1.81929   1.00000  1.8 TiB  606 GiB  600 GiB   1.7
>>>> GiB  4.3
>>>> >> > GiB  1.2 TiB  32.51  1.06   61      up          osd.43
>>>> >> >  44    ssd    1.81929   1.00000  1.8 TiB  583 GiB  577 GiB   1.6
>>>> GiB  4.2
>>>> >> > GiB  1.3 TiB  31.29  1.02   60      up          osd.44
>>>> >> >  45    ssd    1.81929   1.00000  1.8 TiB  618 GiB  613 GiB   1.4
>>>> GiB  4.3
>>>> >> > GiB  1.2 TiB  33.18  1.08   62      up          osd.45
>>>> >> >  46    ssd    1.81929   1.00000  1.8 TiB  550 GiB  544 GiB   1.5
>>>> GiB  4.2
>>>> >> > GiB  1.3 TiB  29.50  0.96   54      up          osd.46
>>>> >> >  47    ssd    1.81929   1.00000  1.8 TiB  526 GiB  522 GiB   692
>>>> MiB  3.7
>>>> >> > GiB  1.3 TiB  28.25  0.92   50      up          osd.47
>>>> >> >  -9          29.10864         -   29 TiB  9.0 TiB  8.9 TiB    26
>>>> GiB   68
>>>> >> > GiB   20 TiB  31.04  1.01    -              host ud-04
>>>> >> >  48    ssd    1.81929   1.00000  1.8 TiB  540 GiB  534 GiB   2.2
>>>> GiB  3.6
>>>> >> > GiB  1.3 TiB  28.96  0.94   58      up          osd.48
>>>> >> >  49    ssd    1.81929   1.00000  1.8 TiB  617 GiB  611 GiB   1.4
>>>> GiB  4.5
>>>> >> > GiB  1.2 TiB  33.11  1.07   61      up          osd.49
>>>> >> >  50    ssd    1.81929   1.00000  1.8 TiB  618 GiB  612 GiB   1.2
>>>> GiB  4.8
>>>> >> > GiB  1.2 TiB  33.17  1.08   61      up          osd.50
>>>> >> >  51    ssd    1.81929   1.00000  1.8 TiB  618 GiB  612 GiB   1.5
>>>> GiB  4.5
>>>> >> > GiB  1.2 TiB  33.19  1.08   61      up          osd.51
>>>> >> >  52    ssd    1.81929   1.00000  1.8 TiB  526 GiB  521 GiB   1.4
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  28.25  0.92   53      up          osd.52
>>>> >> >  53    ssd    1.81929   1.00000  1.8 TiB  618 GiB  611 GiB   2.4
>>>> GiB  4.3
>>>> >> > GiB  1.2 TiB  33.17  1.08   66      up          osd.53
>>>> >> >  54    ssd    1.81929   1.00000  1.8 TiB  550 GiB  544 GiB   1.5
>>>> GiB  4.3
>>>> >> > GiB  1.3 TiB  29.54  0.96   55      up          osd.54
>>>> >> >  55    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.3
>>>> GiB  4.0
>>>> >> > GiB  1.3 TiB  28.29  0.92   52      up          osd.55
>>>> >> >  56    ssd    1.81929   1.00000  1.8 TiB  525 GiB  519 GiB   1.2
>>>> GiB  4.1
>>>> >> > GiB  1.3 TiB  28.16  0.91   52      up          osd.56
>>>> >> >  57    ssd    1.81929   1.00000  1.8 TiB  615 GiB  609 GiB   2.3
>>>> GiB  4.2
>>>> >> > GiB  1.2 TiB  33.03  1.07   65      up          osd.57
>>>> >> >  58    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.6
>>>> GiB  3.7
>>>> >> > GiB  1.3 TiB  28.31  0.92   55      up          osd.58
>>>> >> >  59    ssd    1.81929   1.00000  1.8 TiB  615 GiB  609 GiB   1.2
>>>> GiB  4.6
>>>> >> > GiB  1.2 TiB  33.01  1.07   60      up          osd.59
>>>> >> >  60    ssd    1.81929   1.00000  1.8 TiB  594 GiB  588 GiB   1.2
>>>> GiB  4.4
>>>> >> > GiB  1.2 TiB  31.88  1.03   59      up          osd.60
>>>> >> >  61    ssd    1.81929   1.00000  1.8 TiB  616 GiB  610 GiB   1.9
>>>> GiB  4.1
>>>> >> > GiB  1.2 TiB  33.04  1.07   64      up          osd.61
>>>> >> >  62    ssd    1.81929   1.00000  1.8 TiB  620 GiB  614 GiB   1.9
>>>> GiB  4.4
>>>> >> > GiB  1.2 TiB  33.27  1.08   63      up          osd.62
>>>> >> >  63    ssd    1.81929   1.00000  1.8 TiB  527 GiB  522 GiB   1.5
>>>> GiB  4.0
>>>> >> > GiB  1.3 TiB  28.30  0.92   53      up          osd.63
>>>> >> > -11          29.10864         -   29 TiB  9.0 TiB  8.9 TiB    23
>>>> GiB   65
>>>> >> > GiB   20 TiB  30.91  1.00    -              host ud-05
>>>> >> >  64    ssd    1.81929   1.00000  1.8 TiB  608 GiB  601 GiB   2.3
>>>> GiB  4.5
>>>> >> > GiB  1.2 TiB  32.62  1.06   65      up          osd.64
>>>> >> >  65    ssd    1.81929   1.00000  1.8 TiB  606 GiB  601 GiB   628
>>>> MiB  4.2
>>>> >> > GiB  1.2 TiB  32.53  1.06   57      up          osd.65
>>>> >> >  66    ssd    1.81929   1.00000  1.8 TiB  583 GiB  578 GiB   1.3
>>>> GiB  4.3
>>>> >> > GiB  1.2 TiB  31.31  1.02   57      up          osd.66
>>>> >> >  67    ssd    1.81929   1.00000  1.8 TiB  537 GiB  533 GiB   436
>>>> MiB  3.6
>>>> >> > GiB  1.3 TiB  28.82  0.94   50      up          osd.67
>>>> >> >  68    ssd    1.81929   1.00000  1.8 TiB  541 GiB  535 GiB   2.5
>>>> GiB  3.8
>>>> >> > GiB  1.3 TiB  29.04  0.94   59      up          osd.68
>>>> >> >  69    ssd    1.81929   1.00000  1.8 TiB  606 GiB  601 GiB   1.1
>>>> GiB  4.4
>>>> >> > GiB  1.2 TiB  32.55  1.06   59      up          osd.69
>>>> >> >  70    ssd    1.81929   1.00000  1.8 TiB  604 GiB  598 GiB   1.8
>>>> GiB  4.1
>>>> >> > GiB  1.2 TiB  32.44  1.05   63      up          osd.70
>>>> >> >  71    ssd    1.81929   1.00000  1.8 TiB  606 GiB  600 GiB   1.9
>>>> GiB  4.5
>>>> >> > GiB  1.2 TiB  32.53  1.06   62      up          osd.71
>>>> >> >  72    ssd    1.81929   1.00000  1.8 TiB  602 GiB  598 GiB   612
>>>> MiB  4.1
>>>> >> > GiB  1.2 TiB  32.33  1.05   57      up          osd.72
>>>> >> >  73    ssd    1.81929   1.00000  1.8 TiB  571 GiB  565 GiB   1.8
>>>> GiB  4.5
>>>> >> > GiB  1.3 TiB  30.65  0.99   58      up          osd.73
>>>> >> >  74    ssd    1.81929   1.00000  1.8 TiB  608 GiB  602 GiB   1.8
>>>> GiB  4.2
>>>> >> > GiB  1.2 TiB  32.62  1.06   61      up          osd.74
>>>> >> >  75    ssd    1.81929   1.00000  1.8 TiB  536 GiB  531 GiB   1.9
>>>> GiB  3.5
>>>> >> > GiB  1.3 TiB  28.80  0.93   57      up          osd.75
>>>> >> >  76    ssd    1.81929   1.00000  1.8 TiB  605 GiB  599 GiB   1.4
>>>> GiB  4.5
>>>> >> > GiB  1.2 TiB  32.48  1.05   60      up          osd.76
>>>> >> >  77    ssd    1.81929   1.00000  1.8 TiB  537 GiB  532 GiB   1.2
>>>> GiB  3.9
>>>> >> > GiB  1.3 TiB  28.84  0.94   52      up          osd.77
>>>> >> >  78    ssd    1.81929   1.00000  1.8 TiB  525 GiB  520 GiB   1.3
>>>> GiB  3.8
>>>> >> > GiB  1.3 TiB  28.20  0.92   52      up          osd.78
>>>> >> >  79    ssd    1.81929   1.00000  1.8 TiB  536 GiB  531 GiB   1.1
>>>> GiB  3.3
>>>> >> > GiB  1.3 TiB  28.76  0.93   53      up          osd.79
>>>> >> >                           TOTAL  146 TiB   45 TiB   44 TiB   119
>>>> GiB  333
>>>> >> > GiB  101 TiB  30.81
>>>> >> > MIN/MAX VAR: 0.91/1.08  STDDEV: 1.90
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > Eugen Block <eblock@xxxxxx>, 25 Oca 2024 Per, 16:52 tarihinde şunu
>>>> >> yazdı:
>>>> >> >
>>>> >> >> There is no definitive answer wrt mds tuning. As it is everywhere
>>>> >> >> mentioned, it's about finding the right setup for your specific
>>>> >> >> workload. If you can synthesize your workload (maybe scale down a
>>>> bit)
>>>> >> >> try optimizing it in a test cluster without interrupting your
>>>> >> >> developers too much.
>>>> >> >> But what you haven't explained yet is what are you experiencing
>>>> as a
>>>> >> >> performance issue? Do you have numbers or a detailed description?
>>>> >> >>  From the fs status output you didn't seem to have too much
>>>> activity
>>>> >> >> going on (around 140 requests per second), but that's probably
>>>> not the
>>>> >> >> usual traffic? What does ceph report in its client IO output?
>>>> >> >> Can you paste the 'ceph osd df' output as well?
>>>> >> >> Do you have dedicated MDS servers or are they colocated with other
>>>> >> >> services?
>>>> >> >>
>>>> >> >> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>:
>>>> >> >>
>>>> >> >> > Hello  Eugen.
>>>> >> >> >
>>>> >> >> > I read all of your MDS related topics and thank you so much for
>>>> your
>>>> >> >> effort
>>>> >> >> > on this.
>>>> >> >> > There is not much information and I couldn't find a MDS tuning
>>>> guide
>>>> >> at
>>>> >> >> > all. It  seems that you are the correct person to discuss mds
>>>> >> debugging
>>>> >> >> and
>>>> >> >> > tuning.
>>>> >> >> >
>>>> >> >> > Do you have any documents or may I learn what is the proper way
>>>> to
>>>> >> debug
>>>> >> >> > MDS and clients ?
>>>> >> >> > Which debug logs will guide me to understand the limitations
>>>> and will
>>>> >> >> help
>>>> >> >> > to tune according to the data flow?
>>>> >> >> >
>>>> >> >> > While searching, I find this:
>>>> >> >> >
>>>> >> >>
>>>> >>
>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/
>>>> >> >> > quote:"A user running VSCodium, keeping 15k caps open.. the
>>>> >> opportunistic
>>>> >> >> > caps recall eventually starts recalling those but the (el7
>>>> kernel)
>>>> >> client
>>>> >> >> > won't release them. Stopping Codium seems to be the only way to
>>>> >> release."
>>>> >> >> >
>>>> >> >> > Because of this I think I also need to play around with the
>>>> client
>>>> >> side
>>>> >> >> too.
>>>> >> >> >
>>>> >> >> > My main goal is increasing the speed and reducing the latency
>>>> and I
>>>> >> >> wonder
>>>> >> >> > if these ideas are correct or not:
>>>> >> >> > - Maybe I need to increase client side cache size because via
>>>> each
>>>> >> >> client,
>>>> >> >> > multiple users request a lot of objects and clearly the
>>>> >> >> > client_cache_size=16 default is not enough.
>>>> >> >> > -  Maybe I need to increase client side maximum cache limit for
>>>> >> >> > object "client_oc_max_objects=1000 to 10000" and data
>>>> >> >> "client_oc_size=200mi
>>>> >> >> > to 400mi"
>>>> >> >> > - The client cache cleaning threshold is not aggressive enough
>>>> to keep
>>>> >> >> the
>>>> >> >> > free cache size in the desired range. I need to make it
>>>> aggressive but
>>>> >> >> this
>>>> >> >> > should not reduce speed and increase latency.
>>>> >> >> >
>>>> >> >> > mds_cache_memory_limit=4gi to 16gi
>>>> >> >> > client_oc_max_objects=1000 to 10000
>>>> >> >> > client_oc_size=200mi to 400mi
>>>> >> >> > client_permissions=false #to reduce latency.
>>>> >> >> > client_cache_size=16 to 128
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > What do you think?
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx