I decided to tune cephfs client's kernels and increase network buffers to increase speed. This time my client has 1x 10Gbit DAC cable. Client version is 1 step ahead: ceph-common/stable,now 17.2.7-1focal amd64 [installed] The kernel tunnings: root@maradona:~# cat /etc/sysctl.conf net.ipv4.tcp_syncookies = 0 # Disable syncookies (syncookies are not RFC compliant and can use too muche resources) net.ipv4.tcp_keepalive_time = 600 # Keepalive time for TCP connections (seconds) net.ipv4.tcp_synack_retries = 3 # Number of SYNACK retries before giving up net.ipv4.tcp_syn_retries = 3 # Number of SYN retries before giving up net.ipv4.tcp_rfc1337 = 1 # RFC1337 The set to 1 to enable RFC 1337 protection. net.ipv4.conf.all.log_martians = 1 # Log packets with impossible addresses to kernel log net.ipv4.inet_peer_gc_mintime = 5 # Minimum interval between garbage collection passes This interval is net.ipv4.tcp_ecn = 0 # Disable Explicit Congestion Notification in TCP net.ipv4.tcp_window_scaling = 1 # Enable window scaling as defined in RFC1323 net.ipv4.tcp_timestamps = 1 # Enable timestamps (RFC1323) net.ipv4.tcp_sack = 1 # Enable select acknowledgments net.ipv4.tcp_fack = 1 # Enable FACK congestion avoidance and fast restransmission net.ipv4.tcp_dsack = 1 # Allows TCP to send "duplicate" SACKs net.ipv4.ip_forward = 0 # Controls IP packet forwarding net.ipv4.conf.default.rp_filter = 0 # No controls source route verification (RFC1812) net.ipv4.tcp_tw_recycle = 1 # Enable fast recycling TIME-WAIT sockets net.ipv4.tcp_max_syn_backlog = 20000 # to keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog net.ipv4.tcp_max_orphans = 412520 # tells the kernel how many TCP sockets that are not attached to any user file handle to maintain net.ipv4.tcp_orphan_retries = 1 # How may times to retry before killing TCP connection, closed by our side net.ipv4.tcp_fin_timeout = 20 # how long to keep sockets in the state FIN-WAIT-2 if we were the one closing the socket net.ipv4.tcp_max_tw_buckets = 33001472 # maximum number of sockets in TIME-WAIT to be held simultaneously net.ipv4.tcp_no_metrics_save = 1 # don't cache ssthresh from previous connection net.ipv4.tcp_moderate_rcvbuf = 1 # don't cache ssthresh from previous connection net.ipv4.tcp_rmem = 4096 87380 16777216 # increase Linux autotuning TCP buffer limits net.ipv4.tcp_wmem = 4096 65536 16777216 # increase Linux autotuning TCP buffer limits # increase TCP max buffer size # net.core.rmem_max = 16777216 #try this if you get problems # net.core.wmem_max = 16777216 #try this if you get problems net.core.rmem_max = 67108864 net.core.wmem_max = 67108864 net.core.rmem_default = 262144 net.core.wmem_default = 262144 #net.core.netdev_max_backlog = 2500 #try this if you get problems net.core.netdev_max_backlog = 30000 net.core.somaxconn = 65000 net.ipv6.conf.all.disable_ipv6 = 1 # Disable ipv6 # You can monitor the kernel behavior with regard to the dirty # pages by using grep -A 1 dirty /proc/vmstat vm.dirty_background_ratio = 5 vm.dirty_ratio = 15 fs.file-max = 16500736 # system open file limit # Core dump kernel.core_pattern = /var/core_dumps/core.%e.%p.%h.%t fs.suid_dumpable = 2 # Kernel related tunnings kernel.printk = 4 4 1 7 kernel.core_uses_pid = 1 kernel.sysrq = 0 kernel.msgmax = 65536 kernel.msgmnb = 65536 kernel.shmmax = 243314299699 # Maximum shared segment size in bytes kernel.shmall = 66003228 # Maximum number of shared memory segments in pages vm.nr_hugepages = 4096 # Increase Transparent Huge Pages (THP) Defrag: vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping vm.min_free_kbytes = 2640129 # required free memory (set to 1% of physical ram) iobenchmark result: Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 BS=1M write: IOPS=1111, BW=1111MiB/s (1165MB/s)(3072MiB/2764msec); 0 zone resets BS=128K write: IOPS=3812, BW=477MiB/s (500MB/s)(3072MiB/6446msec); 0 zone resets BS=64K write: IOPS=5116, BW=320MiB/s (335MB/s)(3072MiB/9607msec); 0 zone resets BS=32K write: IOPS=6545, BW=205MiB/s (214MB/s)(3072MiB/15018msec); 0 zone resets BS=16K write: IOPS=8004, BW=125MiB/s (131MB/s)(3072MiB/24561msec); 0 zone resets BS=4K write: IOPS=8661, BW=33.8MiB/s (35.5MB/s)(3072MiB/90801msec); 0 zone resets Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 BS=1M read: IOPS=1117, BW=1117MiB/s (1171MB/s)(3072MiB/2750msec) BS=128K read: IOPS=8353, BW=1044MiB/s (1095MB/s)(3072MiB/2942msec) BS=64K read: IOPS=11.8k, BW=739MiB/s (775MB/s)(3072MiB/4155msec) BS=32K read: IOPS=16.3k, BW=508MiB/s (533MB/s)(3072MiB/6049msec) BS=16K read: IOPS=23.0k, BW=375MiB/s (393MB/s)(3072MiB/8195msec) BS=4K read: IOPS=27.4k, BW=107MiB/s (112MB/s)(3072MiB/28740msec) Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 BS=1M write: IOPS=1102, BW=1103MiB/s (1156MB/s)(3072MiB/2786msec); 0 zone resets BS=128K write: IOPS=8581, BW=1073MiB/s (1125MB/s)(3072MiB/2864msec); 0 zone resets BS=64K write: IOPS=10.9k, BW=681MiB/s (714MB/s)(3072MiB/4511msec); 0 zone resets BS=32K write: IOPS=12.1k, BW=378MiB/s (396MB/s)(3072MiB/8129msec); 0 zone resets BS=16K write: IOPS=12.7k, BW=198MiB/s (208MB/s)(3072MiB/15487msec); 0 zone resets BS=4K write: IOPS=12.7k, BW=49.7MiB/s (52.1MB/s)(3072MiB/61848msec); 0 zone resets Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 BS=1M read: IOPS=1113, BW=1114MiB/s (1168MB/s)(3072MiB/2758msec) BS=128K read: IOPS=8953, BW=1119MiB/s (1173MB/s)(3072MiB/2745msec) BS=64K read: IOPS=17.9k, BW=1116MiB/s (1170MB/s)(3072MiB/2753msec) BS=32K read: IOPS=35.1k, BW=1096MiB/s (1150MB/s)(3072MiB/2802msec) BS=16K read: IOPS=69.4k, BW=1085MiB/s (1138MB/s)(3072MiB/2831msec) BS=4K read: IOPS=112k, BW=438MiB/s (459MB/s)(3072MiB/7015msec) *Everything looks good except 4K speeds:* Seq Write - BS=4K write: IOPS=8661, BW=33.8MiB/s (35.5MB/s)(3072MiB/90801msec); 0 zone resets Rand Write - BS=4K write: IOPS=12.7k, BW=49.7MiB/s (52.1MB/s)(3072MiB/61848msec); 0 zone resets What do you think? Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 04:08 tarihinde şunu yazdı: > Wow I noticed something! > > To prevent ram overflow with gpu training allocations, I'm using a 2TB > Samsung 870 evo for swap. > > As you can see below, swap usage 18Gi and server was idle, that means > maybe ceph client hits latency because of the swap usage. > > root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# > free -h > total used free shared buff/cache > available > Mem: 62Gi 34Gi 27Gi 0.0Ki 639Mi > 27Gi > Swap: 1.8Ti 18Gi 1.8Ti > > I decided to play around kernel parameters to prevent ceph swap usage. > > kernel.shmmax = 60654764851 # Maximum shared segment size in bytes >> kernel.shmall = 16453658 # Maximum number of shared memory segments in >> pages >> vm.nr_hugepages = 4096 # Increase Transparent Huge Pages (THP) Defrag: >> vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping >> vm.min_free_kbytes = 1048576 # required free memory (set to 1% of >> physical ram) > > > I reboot the server and after reboot swap usage is 0 as expected. > > To give a try I started the iobench.sh ( > https://github.com/ozkangoksu/benchmark/blob/main/iobench.sh) > This client has 1G nic only. As you can see below, other then 4K block > size, ceph client can saturate NIC. > > root@bmw-m4:~# nicstat -MUz 1 > Time Int rMbps wMbps rPk/s wPk/s rAvs wAvs %rUtil > %wUtil > 01:04:48 ens1f0 936.9 92.90 91196.8 60126.3 1346.6 202.5 98.2 > 9.74 > > root@bmw-m4:/mounts/ud-data/benchuser1/96f13211-c37f-42db-8d05-f3255a05129e/testdir# > bash iobench.sh > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1M write: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27395msec); 0 > zone resets > BS=128K write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27462msec); 0 > zone resets > BS=64K write: IOPS=1758, BW=110MiB/s (115MB/s)(3072MiB/27948msec); 0 > zone resets > BS=32K write: IOPS=3542, BW=111MiB/s (116MB/s)(3072MiB/27748msec); 0 > zone resets > BS=16K write: IOPS=6839, BW=107MiB/s (112MB/s)(3072MiB/28747msec); 0 > zone resets > BS=4K write: IOPS=8473, BW=33.1MiB/s (34.7MB/s)(3072MiB/92813msec); 0 > zone resets > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1M read: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27386msec) > BS=128K read: IOPS=895, BW=112MiB/s (117MB/s)(3072MiB/27431msec) > BS=64K read: IOPS=1788, BW=112MiB/s (117MB/s)(3072MiB/27486msec) > BS=32K read: IOPS=3561, BW=111MiB/s (117MB/s)(3072MiB/27603msec) > BS=16K read: IOPS=6924, BW=108MiB/s (113MB/s)(3072MiB/28392msec) > BS=4K read: IOPS=21.3k, BW=83.3MiB/s (87.3MB/s)(3072MiB/36894msec) > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1M write: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27406msec); 0 > zone resets > BS=128K write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27466msec); 0 > zone resets > BS=64K write: IOPS=1781, BW=111MiB/s (117MB/s)(3072MiB/27591msec); 0 > zone resets > BS=32K write: IOPS=3545, BW=111MiB/s (116MB/s)(3072MiB/27729msec); 0 > zone resets > BS=16K write: IOPS=6823, BW=107MiB/s (112MB/s)(3072MiB/28814msec); 0 > zone resets > BS=4K write: IOPS=12.7k, BW=49.8MiB/s (52.2MB/s)(3072MiB/61694msec); 0 > zone resets > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1M read: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27388msec) > BS=128K read: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27479msec) > BS=64K read: IOPS=1784, BW=112MiB/s (117MB/s)(3072MiB/27547msec) > BS=32K read: IOPS=3559, BW=111MiB/s (117MB/s)(3072MiB/27614msec) > BS=16K read: IOPS=7047, BW=110MiB/s (115MB/s)(3072MiB/27897msec) > BS=4K read: IOPS=26.9k, BW=105MiB/s (110MB/s)(3072MiB/29199msec) > > > > root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702# > cat metrics > item total > ------------------------------------------ > opened files / total inodes 0 / 109 > pinned i_caps / total inodes 109 / 109 > opened inodes / total inodes 0 / 109 > > item total avg_lat(us) min_lat(us) max_lat(us) > stdev(us) > > ----------------------------------------------------------------------------------- > read 2316289 13904 221 8827984 > 760 > write 2317824 21152 2975 9243821 > 2365 > metadata 170 5944 225 202505 > 24314 > > item total avg_sz(bytes) min_sz(bytes) max_sz(bytes) > total_sz(bytes) > > ---------------------------------------------------------------------------------------- > read 2316289 16688 4096 1048576 > 38654712361 > write 2317824 19457 4096 4194304 > 45097156608 > > item total miss hit > ------------------------------------------------- > d_lease 112 3 858 > caps 109 58 6963547 > > root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1818702# > free -h > total used free shared buff/cache > available > Mem: 62Gi 11Gi 50Gi 3.0Mi 1.0Gi > 49Gi > Swap: 1.8Ti 0B 1.8Ti > > > I started to feel we are getting closer :) > > > > Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 02:58 tarihinde şunu > yazdı: > >> I started to investigate my clients. >> >> for example: >> >> root@ud-01:~# ceph health detail >> HEALTH_WARN 1 clients failing to respond to cache pressure >> [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure >> mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to respond to >> cache pressure client_id: 1275577 >> >> root@ud-01:~# ceph fs status >> ud-data - 86 clients >> ======= >> RANK STATE MDS ACTIVITY DNS INOS DIRS >> CAPS >> 0 active ud-data.ud-02.xcoojt Reqs: 34 /s 2926k 2827k 155k >> 1157k >> >> >> ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] | "clientid: >> \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases), >> request_load_avg: \(.request_load_avg), num_completed_requests: >> \(.num_completed_requests), num_completed_flushes: >> \(.num_completed_flushes)"' | sort -n -t: -k3 >> >> clientid: *1275577*= num_caps: 12312, num_leases: 0, request_load_avg: >> 0, num_completed_requests: 0, num_completed_flushes: 1 >> clientid: 1275571= num_caps: 16307, num_leases: 1, request_load_avg: >> 2101, num_completed_requests: 0, num_completed_flushes: 3 >> clientid: 1282130= num_caps: 26337, num_leases: 3, request_load_avg: 116, >> num_completed_requests: 0, num_completed_flushes: 1 >> clientid: 1191789= num_caps: 32784, num_leases: 0, request_load_avg: >> 1846, num_completed_requests: 0, num_completed_flushes: 0 >> clientid: 1275535= num_caps: 79825, num_leases: 2, request_load_avg: 133, >> num_completed_requests: 8, num_completed_flushes: 8 >> clientid: 1282142= num_caps: 80581, num_leases: 6, request_load_avg: 125, >> num_completed_requests: 2, num_completed_flushes: 6 >> clientid: 1275532= num_caps: 87836, num_leases: 3, request_load_avg: 190, >> num_completed_requests: 2, num_completed_flushes: 6 >> clientid: 1275547= num_caps: 94129, num_leases: 4, request_load_avg: 149, >> num_completed_requests: 2, num_completed_flushes: 4 >> clientid: 1275553= num_caps: 96460, num_leases: 4, request_load_avg: 155, >> num_completed_requests: 2, num_completed_flushes: 8 >> clientid: 1282139= num_caps: 108882, num_leases: 25, request_load_avg: >> 99, num_completed_requests: 2, num_completed_flushes: 4 >> clientid: 1275538= num_caps: 437162, num_leases: 0, request_load_avg: >> 101, num_completed_requests: 2, num_completed_flushes: 0 >> >> -------------------------------------- >> >> *MY CLIENT:* >> >> The client is actually at idle mode and there is no reason to fail at >> all. >> >> root@bmw-m4:~# apt list --installed |grep ceph >> ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 [installed] >> libcephfs2/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 >> [installed,automatic] >> python3-ceph-argparse/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 >> [installed,automatic] >> python3-ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 all >> [installed,automatic] >> python3-cephfs/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 >> [installed,automatic] >> >> Let's check metrics and stats: >> >> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# >> cat metrics >> item total >> ------------------------------------------ >> opened files / total inodes 2 / 12312 >> pinned i_caps / total inodes 12312 / 12312 >> opened inodes / total inodes 1 / 12312 >> >> item total avg_lat(us) min_lat(us) max_lat(us) >> stdev(us) >> >> ----------------------------------------------------------------------------------- >> read 22283 44409 430 1804853 >> 15619 >> write 112702 419725 3658 8879541 >> 6008 >> metadata 353322 5712 154 917903 >> 5357 >> >> item total avg_sz(bytes) min_sz(bytes) max_sz(bytes) >> total_sz(bytes) >> >> ---------------------------------------------------------------------------------------- >> read 22283 1701940 1 4194304 >> 37924318602 >> write 112702 246211 1 4194304 >> 27748469309 >> >> item total miss hit >> ------------------------------------------------- >> d_lease 62 63627 28564698 >> caps 12312 36658 44568261 >> >> >> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# >> cat bdi/stats >> BdiWriteback: 0 kB >> BdiReclaimable: 800 kB >> BdiDirtyThresh: 0 kB >> DirtyThresh: 5795340 kB >> BackgroundThresh: 2894132 kB >> BdiDirtied: 27316320 kB >> BdiWritten: 27316320 kB >> BdiWriteBandwidth: 1472 kBps >> b_dirty: 0 >> b_io: 0 >> b_more_io: 0 >> b_dirty_time: 0 >> bdi_list: 1 >> state: 1 >> >> >> Last 3 days dmesg output: >> >> [Wed Jan 24 16:45:13 2024] xfsettingsd[653036]: segfault at 18 ip >> 00007fbd12f5d337 sp 00007ffd254332a0 error 4 in >> libxklavier.so.16.4.0[7fbd12f4d000+19000] >> [Wed Jan 24 16:45:13 2024] Code: 4c 89 e7 e8 0b 56 ff ff 48 89 03 48 8b >> 5c 24 30 e9 d1 fd ff ff e8 b9 5b ff ff 66 0f 1f 84 00 00 00 00 00 41 54 55 >> 48 89 f5 53 <48> 8b 42 18 48 89 d1 49 89 fc 48 89 d3 48 89 fa 48 89 ef 48 >> 8b b0 >> [Thu Jan 25 06:51:31 2024] NVRM: GPU at PCI:0000:81:00: >> GPU-02efbb18-c9e4-3a16-d615-598959520b99 >> [Thu Jan 25 06:51:31 2024] NVRM: GPU Board Serial Number: 1321421015411 >> [Thu Jan 25 06:51:31 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=683281, >> name=python, Ch 00000008 >> [Thu Jan 25 06:56:49 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=683377, >> name=python, Ch 00000018 >> [Thu Jan 25 20:14:13 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=696062, >> name=python, Ch 00000008 >> [Fri Jan 26 04:05:40 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=700166, >> name=python, Ch 00000008 >> [Fri Jan 26 05:05:12 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=700320, >> name=python, Ch 00000008 >> [Fri Jan 26 05:44:50 2024] NVRM: GPU at PCI:0000:82:00: >> GPU-3af62a2c-e7eb-a7d5-c073-22f06dc7065f >> [Fri Jan 26 05:44:50 2024] NVRM: GPU Board Serial Number: 1321421010400 >> [Fri Jan 26 05:44:50 2024] NVRM: Xid (PCI:0000:82:00): 43, pid=700757, >> name=python, Ch 00000018 >> [Fri Jan 26 05:56:02 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=701096, >> name=python, Ch 00000028 >> [Fri Jan 26 06:34:20 2024] NVRM: Xid (PCI:0000:81:00): 43, pid=701226, >> name=python, Ch 00000038 >> >> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# >> free -h >> total used free shared buff/cache >> available >> Mem: 62Gi 34Gi 27Gi 0.0Ki 639Mi >> 27Gi >> Swap: 1.8Ti 18Gi 1.8Ti >> >> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# >> cat /proc/vmstat >> nr_free_pages 7231171 >> nr_zone_inactive_anon 7924766 >> nr_zone_active_anon 525190 >> nr_zone_inactive_file 44029 >> nr_zone_active_file 55966 >> nr_zone_unevictable 13042 >> nr_zone_write_pending 3 >> nr_mlock 13042 >> nr_bounce 0 >> nr_zspages 0 >> nr_free_cma 0 >> numa_hit 6701928919 >> numa_miss 312628341 >> numa_foreign 312628341 >> numa_interleave 31538 >> numa_local 6701864751 >> numa_other 312692567 >> nr_inactive_anon 7924766 >> nr_active_anon 525190 >> nr_inactive_file 44029 >> nr_active_file 55966 >> nr_unevictable 13042 >> nr_slab_reclaimable 61076 >> nr_slab_unreclaimable 63509 >> nr_isolated_anon 0 >> nr_isolated_file 0 >> workingset_nodes 3934 >> workingset_refault_anon 30325493 >> workingset_refault_file 14593094 >> workingset_activate_anon 5376050 >> workingset_activate_file 3250679 >> workingset_restore_anon 292317 >> workingset_restore_file 1166673 >> workingset_nodereclaim 488665 >> nr_anon_pages 8451968 >> nr_mapped 35731 >> nr_file_pages 138824 >> nr_dirty 3 >> nr_writeback 0 >> nr_writeback_temp 0 >> nr_shmem 242 >> nr_shmem_hugepages 0 >> nr_shmem_pmdmapped 0 >> nr_file_hugepages 0 >> nr_file_pmdmapped 0 >> nr_anon_transparent_hugepages 3588 >> nr_vmscan_write 33746573 >> nr_vmscan_immediate_reclaim 160 >> nr_dirtied 48165341 >> nr_written 80207893 >> nr_kernel_misc_reclaimable 0 >> nr_foll_pin_acquired 174002 >> nr_foll_pin_released 174002 >> nr_kernel_stack 60032 >> nr_page_table_pages 46041 >> nr_swapcached 36166 >> nr_dirty_threshold 1448010 >> nr_dirty_background_threshold 723121 >> pgpgin 129904699 >> pgpgout 299261581 >> pswpin 30325493 >> pswpout 45158221 >> pgalloc_dma 1024 >> pgalloc_dma32 57788566 >> pgalloc_normal 6956384725 >> pgalloc_movable 0 >> allocstall_dma 0 >> allocstall_dma32 0 >> allocstall_normal 188 >> allocstall_movable 63024 >> pgskip_dma 0 >> pgskip_dma32 0 >> pgskip_normal 0 >> pgskip_movable 0 >> pgfree 7222273815 >> pgactivate 1371753960 >> pgdeactivate 18329381 >> pglazyfree 10 >> pgfault 7795723861 >> pgmajfault 4600007 >> pglazyfreed 0 >> pgrefill 18575528 >> pgreuse 81910383 >> pgsteal_kswapd 980532060 >> pgsteal_direct 38942066 >> pgdemote_kswapd 0 >> pgdemote_direct 0 >> pgscan_kswapd 1135293298 >> pgscan_direct 58883653 >> pgscan_direct_throttle 15 >> pgscan_anon 220939938 >> pgscan_file 973237013 >> pgsteal_anon 46538607 >> pgsteal_file 972935519 >> zone_reclaim_failed 0 >> pginodesteal 0 >> slabs_scanned 25879882 >> kswapd_inodesteal 2179831 >> kswapd_low_wmark_hit_quickly 152797 >> kswapd_high_wmark_hit_quickly 32025 >> pageoutrun 204447 >> pgrotated 44963935 >> drop_pagecache 0 >> drop_slab 0 >> oom_kill 0 >> numa_pte_updates 2724410955 >> numa_huge_pte_updates 1695890 >> numa_hint_faults 1739823254 >> numa_hint_faults_local 1222358972 >> numa_pages_migrated 312611639 >> pgmigrate_success 510846802 >> pgmigrate_fail 875493 >> thp_migration_success 156413 >> thp_migration_fail 2 >> thp_migration_split 0 >> compact_migrate_scanned 1274073243 >> compact_free_scanned 8430842597 >> compact_isolated 400278352 >> compact_stall 145300 >> compact_fail 128562 >> compact_success 16738 >> compact_daemon_wake 170247 >> compact_daemon_migrate_scanned 35486283 >> compact_daemon_free_scanned 369870412 >> htlb_buddy_alloc_success 0 >> htlb_buddy_alloc_fail 0 >> unevictable_pgs_culled 2774290 >> unevictable_pgs_scanned 0 >> unevictable_pgs_rescued 2675031 >> unevictable_pgs_mlocked 2813622 >> unevictable_pgs_munlocked 2674972 >> unevictable_pgs_cleared 84231 >> unevictable_pgs_stranded 84225 >> thp_fault_alloc 416468 >> thp_fault_fallback 19181 >> thp_fault_fallback_charge 0 >> thp_collapse_alloc 17931 >> thp_collapse_alloc_failed 76 >> thp_file_alloc 0 >> thp_file_fallback 0 >> thp_file_fallback_charge 0 >> thp_file_mapped 0 >> thp_split_page 2 >> thp_split_page_failed 0 >> thp_deferred_split_page 66 >> thp_split_pmd 22451 >> thp_split_pud 0 >> thp_zero_page_alloc 1 >> thp_zero_page_alloc_failed 0 >> thp_swpout 22332 >> thp_swpout_fallback 0 >> balloon_inflate 0 >> balloon_deflate 0 >> balloon_migrate 0 >> swap_ra 25777929 >> swap_ra_hit 25658825 >> direct_map_level2_splits 1249 >> direct_map_level3_splits 49 >> nr_unstable 0 >> >> >> >> Özkan Göksu <ozkangksu@xxxxxxxxx>, 27 Oca 2024 Cmt, 02:36 tarihinde şunu >> yazdı: >> >>> Hello Frank. >>> >>> I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel: >>> Linux 5.4.0-125-generic >>> >>> My cluster 17.2.6 quincy. >>> I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I >>> wonder using new version clients is the main problem? >>> Maybe I have a communication error. For example I hit this problem and I >>> can not collect client stats " >>> https://github.com/ceph/ceph/pull/52127/files" >>> >>> Best regards. >>> >>> >>> >>> Frank Schilder <frans@xxxxxx>, 26 Oca 2024 Cum, 14:53 tarihinde şunu >>> yazdı: >>> >>>> Hi, this message is one of those that are often spurious. I don't >>>> recall in which thread/PR/tracker I read it, but the story was something >>>> like that: >>>> >>>> If an MDS gets under memory pressure it will request dentry items back >>>> from *all* clients, not just the active ones or the ones holding many of >>>> them. If you have a client that's below the min-threshold for dentries (its >>>> one of the client/mds tuning options), it will not respond. This client >>>> will be flagged as not responding, which is a false positive. >>>> >>>> I believe the devs are working on a fix to get rid of these spurious >>>> warnings. There is a "bug/feature" in the MDS that does not clear this >>>> warning flag for inactive clients. Hence, the message hangs and never >>>> disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches" >>>> on the client. However, except for being annoying in the dashboard, it has >>>> no performance or otherwise negative impact. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Eugen Block <eblock@xxxxxx> >>>> Sent: Friday, January 26, 2024 10:05 AM >>>> To: Özkan Göksu >>>> Cc: ceph-users@xxxxxxx >>>> Subject: Re: 1 clients failing to respond to cache >>>> pressure (quincy:17.2.6) >>>> >>>> Performance for small files is more about IOPS rather than throughput, >>>> and the IOPS in your fio tests look okay to me. What you could try is >>>> to split the PGs to get around 150 or 200 PGs per OSD. You're >>>> currently at around 60 according to the ceph osd df output. Before you >>>> do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data | >>>> head'? I don't need the whole output, just to see how many objects >>>> each PG has. We had a case once where that helped, but it was an older >>>> cluster and the pool was backed by HDDs and separate rocksDB on SSDs. >>>> So this might not be the solution here, but it could improve things as >>>> well. >>>> >>>> >>>> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>: >>>> >>>> > Every user has a 1x subvolume and I only have 1 pool. >>>> > At the beginning we were using each subvolume for ldap home directory >>>> + >>>> > user data. >>>> > When a user logins any docker on any host, it was using the cluster >>>> for >>>> > home and the for user related data, we was have second directory in >>>> the >>>> > same subvolume. >>>> > Time to time users were feeling a very slow home environment and >>>> after a >>>> > month it became almost impossible to use home. VNC sessions became >>>> > unresponsive and slow etc. >>>> > >>>> > 2 weeks ago, I had to migrate home to a ZFS storage and now the >>>> overall >>>> > performance is better for only user_data without home. >>>> > But still the performance is not good enough as I expected because of >>>> the >>>> > problems related to MDS. >>>> > The usage is low but allocation is high and Cpu usage is high. You >>>> saw the >>>> > IO Op/s, it's nothing but allocation is high. >>>> > >>>> > I develop a fio benchmark script and I run the script on 4x test >>>> server at >>>> > the same time, the results are below: >>>> > Script: >>>> > >>>> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh >>>> > >>>> > >>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt >>>> > >>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt >>>> > >>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt >>>> > >>>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt >>>> > >>>> > While running benchmark, I take sample values for each type of >>>> iobench run. >>>> > >>>> > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 >>>> > client: 70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr >>>> > client: 60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr >>>> > client: 13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr >>>> > >>>> > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 >>>> > client: 1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr >>>> > client: 370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr >>>> > >>>> > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 >>>> > client: 63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr >>>> > client: 14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr >>>> > client: 6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr >>>> > >>>> > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 >>>> > client: 317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr >>>> > client: 2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr >>>> > client: 4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr >>>> > client: 2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr >>>> > >>>> > It seems I only have problems with the 4K,8K,16K other sector sizes. >>>> > >>>> > >>>> > >>>> > >>>> > Eugen Block <eblock@xxxxxx>, 25 Oca 2024 Per, 19:06 tarihinde şunu >>>> yazdı: >>>> > >>>> >> I understand that your MDS shows a high CPU usage, but other than >>>> that >>>> >> what is your performance issue? Do users complain? Do some operations >>>> >> take longer than expected? Are OSDs saturated during those phases? >>>> >> Because the cache pressure messages don’t necessarily mean that users >>>> >> will notice. >>>> >> MDS daemons are single-threaded so that might be a bottleneck. In >>>> that >>>> >> case multi-active mds might help, which you already tried and >>>> >> experienced OOM killers. But you might have to disable the mds >>>> >> balancer as someone else mentioned. And then you could think about >>>> >> pinning, is it possible to split the CephFS into multiple >>>> >> subdirectories and pin them to different ranks? >>>> >> But first I’d still like to know what the performance issue really >>>> is. >>>> >> >>>> >> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>: >>>> >> >>>> >> > I will try my best to explain my situation. >>>> >> > >>>> >> > I don't have a separate mds server. I have 5 identical nodes, 3 of >>>> them >>>> >> > mons, and I use the other 2 as active and standby mds. (currently >>>> I have >>>> >> > left overs from max_mds 4) >>>> >> > >>>> >> > root@ud-01:~# ceph -s >>>> >> > cluster: >>>> >> > id: e42fd4b0-313b-11ee-9a00-31da71873773 >>>> >> > health: HEALTH_WARN >>>> >> > 1 clients failing to respond to cache pressure >>>> >> > >>>> >> > services: >>>> >> > mon: 3 daemons, quorum ud-01,ud-02,ud-03 (age 9d) >>>> >> > mgr: ud-01.qycnol(active, since 8d), standbys: ud-02.tfhqfd >>>> >> > mds: 1/1 daemons up, 4 standby >>>> >> > osd: 80 osds: 80 up (since 9d), 80 in (since 5M) >>>> >> > >>>> >> > data: >>>> >> > volumes: 1/1 healthy >>>> >> > pools: 3 pools, 2305 pgs >>>> >> > objects: 106.58M objects, 25 TiB >>>> >> > usage: 45 TiB used, 101 TiB / 146 TiB avail >>>> >> > pgs: 2303 active+clean >>>> >> > 2 active+clean+scrubbing+deep >>>> >> > >>>> >> > io: >>>> >> > client: 16 MiB/s rd, 3.4 MiB/s wr, 77 op/s rd, 23 op/s wr >>>> >> > >>>> >> > ------------------------------ >>>> >> > root@ud-01:~# ceph fs status >>>> >> > ud-data - 84 clients >>>> >> > ======= >>>> >> > RANK STATE MDS ACTIVITY DNS INOS >>>> DIRS >>>> >> > CAPS >>>> >> > 0 active ud-data.ud-02.xcoojt Reqs: 40 /s 2579k 2578k >>>> 169k >>>> >> > 3048k >>>> >> > POOL TYPE USED AVAIL >>>> >> > cephfs.ud-data.meta metadata 136G 44.9T >>>> >> > cephfs.ud-data.data data 44.3T 44.9T >>>> >> > >>>> >> > ------------------------------ >>>> >> > root@ud-01:~# ceph health detail >>>> >> > HEALTH_WARN 1 clients failing to respond to cache pressure >>>> >> > [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache >>>> pressure >>>> >> > mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to >>>> respond to >>>> >> > cache pressure client_id: 1275577 >>>> >> > >>>> >> > ------------------------------ >>>> >> > When I check the failing client with session ls I see only >>>> "num_caps: >>>> >> 12298" >>>> >> > >>>> >> > ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] | >>>> "clientid: >>>> >> > \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases), >>>> >> > request_load_avg: \(.request_load_avg), num_completed_requests: >>>> >> > \(.num_completed_requests), num_completed_flushes: >>>> >> > \(.num_completed_flushes)"' | sort -n -t: -k3 >>>> >> > >>>> >> > clientid: 1275577= num_caps: 12298, num_leases: 0, >>>> request_load_avg: 0, >>>> >> > num_completed_requests: 0, num_completed_flushes: 1 >>>> >> > clientid: 1294542= num_caps: 13000, num_leases: 12, >>>> request_load_avg: >>>> >> 105, >>>> >> > num_completed_requests: 0, num_completed_flushes: 6 >>>> >> > clientid: 1282187= num_caps: 16869, num_leases: 1, >>>> request_load_avg: 0, >>>> >> > num_completed_requests: 0, num_completed_flushes: 1 >>>> >> > clientid: 1275589= num_caps: 18943, num_leases: 0, >>>> request_load_avg: 52, >>>> >> > num_completed_requests: 0, num_completed_flushes: 1 >>>> >> > clientid: 1282154= num_caps: 24747, num_leases: 1, >>>> request_load_avg: 57, >>>> >> > num_completed_requests: 2, num_completed_flushes: 2 >>>> >> > clientid: 1275553= num_caps: 25120, num_leases: 2, >>>> request_load_avg: 116, >>>> >> > num_completed_requests: 2, num_completed_flushes: 8 >>>> >> > clientid: 1282142= num_caps: 27185, num_leases: 6, >>>> request_load_avg: 128, >>>> >> > num_completed_requests: 0, num_completed_flushes: 8 >>>> >> > clientid: 1275535= num_caps: 40364, num_leases: 6, >>>> request_load_avg: 111, >>>> >> > num_completed_requests: 2, num_completed_flushes: 8 >>>> >> > clientid: 1282130= num_caps: 41483, num_leases: 0, >>>> request_load_avg: 135, >>>> >> > num_completed_requests: 0, num_completed_flushes: 1 >>>> >> > clientid: 1275547= num_caps: 42953, num_leases: 4, >>>> request_load_avg: 119, >>>> >> > num_completed_requests: 2, num_completed_flushes: 6 >>>> >> > clientid: 1282139= num_caps: 45435, num_leases: 27, >>>> request_load_avg: 84, >>>> >> > num_completed_requests: 2, num_completed_flushes: 34 >>>> >> > clientid: 1282136= num_caps: 48374, num_leases: 8, >>>> request_load_avg: 0, >>>> >> > num_completed_requests: 1, num_completed_flushes: 1 >>>> >> > clientid: 1275532= num_caps: 48664, num_leases: 7, >>>> request_load_avg: 115, >>>> >> > num_completed_requests: 2, num_completed_flushes: 8 >>>> >> > clientid: 1191789= num_caps: 130319, num_leases: 0, >>>> request_load_avg: >>>> >> 1753, >>>> >> > num_completed_requests: 0, num_completed_flushes: 0 >>>> >> > clientid: 1275571= num_caps: 139488, num_leases: 0, >>>> request_load_avg: 2, >>>> >> > num_completed_requests: 0, num_completed_flushes: 1 >>>> >> > clientid: 1282133= num_caps: 145487, num_leases: 0, >>>> request_load_avg: 8, >>>> >> > num_completed_requests: 1, num_completed_flushes: 1 >>>> >> > clientid: 1534496= num_caps: 1041316, num_leases: 0, >>>> request_load_avg: 0, >>>> >> > num_completed_requests: 0, num_completed_flushes: 1 >>>> >> > >>>> >> > ------------------------------ >>>> >> > When I check the dashboard/service/mds I see %120+ CPU usage on >>>> active >>>> >> MDS >>>> >> > but on the host everything is almost idle and disk waits are very >>>> low. >>>> >> > >>>> >> > avg-cpu: %user %nice %system %iowait %steal %idle >>>> >> > 0.61 0.00 0.38 0.41 0.00 98.60 >>>> >> > >>>> >> > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz >>>> w/s >>>> >> > wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s >>>> >> %drqm >>>> >> > d_await dareq-sz f/s f_await aqu-sz %util >>>> >> > sdc 2.00 0.01 0.00 0.00 0.50 6.00 >>>> 20.00 >>>> >> > 0.04 0.00 0.00 0.50 2.00 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 10.00 0.60 0.02 1.20 >>>> >> > sdd 3.00 0.02 0.00 0.00 0.67 8.00 >>>> 285.00 >>>> >> > 1.84 77.00 21.27 0.44 6.61 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 114.00 0.83 0.22 22.40 >>>> >> > sde 1.00 0.01 0.00 0.00 1.00 8.00 >>>> 36.00 >>>> >> > 0.08 3.00 7.69 0.64 2.33 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 18.00 0.67 0.04 1.60 >>>> >> > sdf 5.00 0.04 0.00 0.00 0.40 7.20 >>>> 40.00 >>>> >> > 0.09 3.00 6.98 0.53 2.30 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 20.00 0.70 0.04 2.00 >>>> >> > sdg 11.00 0.08 0.00 0.00 0.73 7.27 >>>> 36.00 >>>> >> > 0.09 4.00 10.00 0.50 2.44 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 18.00 0.72 0.04 3.20 >>>> >> > sdh 5.00 0.03 0.00 0.00 0.60 5.60 >>>> 46.00 >>>> >> > 0.10 2.00 4.17 0.59 2.17 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 23.00 0.83 0.05 2.80 >>>> >> > sdi 7.00 0.04 0.00 0.00 0.43 6.29 >>>> 36.00 >>>> >> > 0.07 1.00 2.70 0.47 2.11 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 18.00 0.61 0.03 2.40 >>>> >> > sdj 5.00 0.04 0.00 0.00 0.80 7.20 >>>> 42.00 >>>> >> > 0.09 1.00 2.33 0.67 2.10 0.00 0.00 0.00 >>>> >> 0.00 >>>> >> > 0.00 0.00 21.00 0.81 0.05 3.20 >>>> >> > >>>> >> > ------------------------------ >>>> >> > Other than this 5x node cluster, I also have a 3x node cluster with >>>> >> > identical hardware but it serves for a different purpose and data >>>> >> workload. >>>> >> > In this cluster I don't have any problem and MDS default settings >>>> seems >>>> >> > enough. >>>> >> > The only difference between two cluster is, 5x node cluster used >>>> directly >>>> >> > by users, 3x node cluster used heavily to read and write data via >>>> >> projects >>>> >> > not by users. So allocate and de-allocate will be better. >>>> >> > >>>> >> > I guess I just have a problematic use case on the 5x node cluster >>>> and as >>>> >> I >>>> >> > mentioned above, I might have the similar problem but I don't know >>>> how to >>>> >> > debug it. >>>> >> > >>>> >> > >>>> >> >>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/ >>>> >> > quote:"A user running VSCodium, keeping 15k caps open.. the >>>> opportunistic >>>> >> > caps recall eventually starts recalling those but the (el7 kernel) >>>> client >>>> >> > won't release them. Stopping Codium seems to be the only way to >>>> release." >>>> >> > >>>> >> > ------------------------------ >>>> >> > Before reading the osd df you should know that I created 2x >>>> >> > OSD/per"CT4000MX500SSD1" >>>> >> > # ceph osd df tree >>>> >> > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP >>>> >> META >>>> >> > AVAIL %USE VAR PGS STATUS TYPE NAME >>>> >> > -1 145.54321 - 146 TiB 45 TiB 44 TiB 119 >>>> GiB 333 >>>> >> > GiB 101 TiB 30.81 1.00 - root default >>>> >> > -3 29.10864 - 29 TiB 8.9 TiB 8.8 TiB 25 >>>> GiB 66 >>>> >> > GiB 20 TiB 30.54 0.99 - host ud-01 >>>> >> > 0 ssd 1.81929 1.00000 1.8 TiB 616 GiB 610 GiB 1.4 >>>> GiB 4.5 >>>> >> > GiB 1.2 TiB 33.04 1.07 61 up osd.0 >>>> >> > 1 ssd 1.81929 1.00000 1.8 TiB 527 GiB 521 GiB 1.5 >>>> GiB 4.0 >>>> >> > GiB 1.3 TiB 28.28 0.92 53 up osd.1 >>>> >> > 2 ssd 1.81929 1.00000 1.8 TiB 595 GiB 589 GiB 2.3 >>>> GiB 4.0 >>>> >> > GiB 1.2 TiB 31.96 1.04 63 up osd.2 >>>> >> > 3 ssd 1.81929 1.00000 1.8 TiB 527 GiB 521 GiB 1.8 >>>> GiB 4.2 >>>> >> > GiB 1.3 TiB 28.30 0.92 55 up osd.3 >>>> >> > 4 ssd 1.81929 1.00000 1.8 TiB 525 GiB 520 GiB 1.3 >>>> GiB 3.9 >>>> >> > GiB 1.3 TiB 28.21 0.92 52 up osd.4 >>>> >> > 5 ssd 1.81929 1.00000 1.8 TiB 592 GiB 586 GiB 1.8 >>>> GiB 3.8 >>>> >> > GiB 1.2 TiB 31.76 1.03 61 up osd.5 >>>> >> > 6 ssd 1.81929 1.00000 1.8 TiB 559 GiB 553 GiB 1.8 >>>> GiB 4.3 >>>> >> > GiB 1.3 TiB 30.03 0.97 57 up osd.6 >>>> >> > 7 ssd 1.81929 1.00000 1.8 TiB 602 GiB 597 GiB 836 >>>> MiB 4.4 >>>> >> > GiB 1.2 TiB 32.32 1.05 58 up osd.7 >>>> >> > 8 ssd 1.81929 1.00000 1.8 TiB 614 GiB 609 GiB 1.2 >>>> GiB 4.5 >>>> >> > GiB 1.2 TiB 32.98 1.07 60 up osd.8 >>>> >> > 9 ssd 1.81929 1.00000 1.8 TiB 571 GiB 565 GiB 2.2 >>>> GiB 4.2 >>>> >> > GiB 1.3 TiB 30.67 1.00 61 up osd.9 >>>> >> > 10 ssd 1.81929 1.00000 1.8 TiB 528 GiB 522 GiB 1.3 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 28.33 0.92 52 up osd.10 >>>> >> > 11 ssd 1.81929 1.00000 1.8 TiB 551 GiB 546 GiB 1.5 >>>> GiB 3.6 >>>> >> > GiB 1.3 TiB 29.57 0.96 56 up osd.11 >>>> >> > 12 ssd 1.81929 1.00000 1.8 TiB 594 GiB 588 GiB 1.8 >>>> GiB 4.4 >>>> >> > GiB 1.2 TiB 31.91 1.04 61 up osd.12 >>>> >> > 13 ssd 1.81929 1.00000 1.8 TiB 561 GiB 555 GiB 1.1 >>>> GiB 4.3 >>>> >> > GiB 1.3 TiB 30.10 0.98 55 up osd.13 >>>> >> > 14 ssd 1.81929 1.00000 1.8 TiB 616 GiB 609 GiB 1.9 >>>> GiB 4.2 >>>> >> > GiB 1.2 TiB 33.04 1.07 64 up osd.14 >>>> >> > 15 ssd 1.81929 1.00000 1.8 TiB 525 GiB 520 GiB 1.1 >>>> GiB 4.0 >>>> >> > GiB 1.3 TiB 28.20 0.92 51 up osd.15 >>>> >> > -5 29.10864 - 29 TiB 9.0 TiB 8.9 TiB 22 >>>> GiB 67 >>>> >> > GiB 20 TiB 30.89 1.00 - host ud-02 >>>> >> > 16 ssd 1.81929 1.00000 1.8 TiB 617 GiB 611 GiB 1.7 >>>> GiB 4.7 >>>> >> > GiB 1.2 TiB 33.12 1.08 63 up osd.16 >>>> >> > 17 ssd 1.81929 1.00000 1.8 TiB 582 GiB 577 GiB 1.6 >>>> GiB 4.0 >>>> >> > GiB 1.3 TiB 31.26 1.01 59 up osd.17 >>>> >> > 18 ssd 1.81929 1.00000 1.8 TiB 583 GiB 578 GiB 418 >>>> MiB 4.0 >>>> >> > GiB 1.3 TiB 31.29 1.02 54 up osd.18 >>>> >> > 19 ssd 1.81929 1.00000 1.8 TiB 550 GiB 544 GiB 1.5 >>>> GiB 4.0 >>>> >> > GiB 1.3 TiB 29.50 0.96 56 up osd.19 >>>> >> > 20 ssd 1.81929 1.00000 1.8 TiB 551 GiB 546 GiB 1.1 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 29.57 0.96 54 up osd.20 >>>> >> > 21 ssd 1.81929 1.00000 1.8 TiB 616 GiB 610 GiB 1.3 >>>> GiB 4.4 >>>> >> > GiB 1.2 TiB 33.04 1.07 60 up osd.21 >>>> >> > 22 ssd 1.81929 1.00000 1.8 TiB 573 GiB 567 GiB 1.6 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 30.75 1.00 58 up osd.22 >>>> >> > 23 ssd 1.81929 1.00000 1.8 TiB 616 GiB 610 GiB 1.3 >>>> GiB 4.3 >>>> >> > GiB 1.2 TiB 33.06 1.07 60 up osd.23 >>>> >> > 24 ssd 1.81929 1.00000 1.8 TiB 539 GiB 534 GiB 844 >>>> MiB 3.8 >>>> >> > GiB 1.3 TiB 28.92 0.94 51 up osd.24 >>>> >> > 25 ssd 1.81929 1.00000 1.8 TiB 583 GiB 576 GiB 2.1 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 31.27 1.02 61 up osd.25 >>>> >> > 26 ssd 1.81929 1.00000 1.8 TiB 617 GiB 611 GiB 1.3 >>>> GiB 4.6 >>>> >> > GiB 1.2 TiB 33.12 1.08 61 up osd.26 >>>> >> > 27 ssd 1.81929 1.00000 1.8 TiB 537 GiB 532 GiB 1.2 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 28.84 0.94 53 up osd.27 >>>> >> > 28 ssd 1.81929 1.00000 1.8 TiB 527 GiB 522 GiB 1.3 >>>> GiB 4.2 >>>> >> > GiB 1.3 TiB 28.29 0.92 53 up osd.28 >>>> >> > 29 ssd 1.81929 1.00000 1.8 TiB 594 GiB 588 GiB 1.5 >>>> GiB 4.6 >>>> >> > GiB 1.2 TiB 31.91 1.04 59 up osd.29 >>>> >> > 30 ssd 1.81929 1.00000 1.8 TiB 528 GiB 523 GiB 1.4 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 28.35 0.92 53 up osd.30 >>>> >> > 31 ssd 1.81929 1.00000 1.8 TiB 594 GiB 589 GiB 1.6 >>>> GiB 3.8 >>>> >> > GiB 1.2 TiB 31.89 1.03 61 up osd.31 >>>> >> > -7 29.10864 - 29 TiB 8.9 TiB 8.8 TiB 23 >>>> GiB 67 >>>> >> > GiB 20 TiB 30.66 1.00 - host ud-03 >>>> >> > 32 ssd 1.81929 1.00000 1.8 TiB 593 GiB 588 GiB 1.1 >>>> GiB 4.3 >>>> >> > GiB 1.2 TiB 31.84 1.03 57 up osd.32 >>>> >> > 33 ssd 1.81929 1.00000 1.8 TiB 617 GiB 611 GiB 1.8 >>>> GiB 4.4 >>>> >> > GiB 1.2 TiB 33.13 1.08 63 up osd.33 >>>> >> > 34 ssd 1.81929 1.00000 1.8 TiB 537 GiB 532 GiB 2.0 >>>> GiB 3.8 >>>> >> > GiB 1.3 TiB 28.84 0.94 59 up osd.34 >>>> >> > 35 ssd 1.81929 1.00000 1.8 TiB 562 GiB 556 GiB 1.7 >>>> GiB 4.2 >>>> >> > GiB 1.3 TiB 30.16 0.98 58 up osd.35 >>>> >> > 36 ssd 1.81929 1.00000 1.8 TiB 529 GiB 523 GiB 1.3 >>>> GiB 3.9 >>>> >> > GiB 1.3 TiB 28.38 0.92 52 up osd.36 >>>> >> > 37 ssd 1.81929 1.00000 1.8 TiB 527 GiB 521 GiB 1.7 >>>> GiB 4.2 >>>> >> > GiB 1.3 TiB 28.28 0.92 55 up osd.37 >>>> >> > 38 ssd 1.81929 1.00000 1.8 TiB 574 GiB 568 GiB 1.2 >>>> GiB 4.3 >>>> >> > GiB 1.3 TiB 30.79 1.00 55 up osd.38 >>>> >> > 39 ssd 1.81929 1.00000 1.8 TiB 605 GiB 599 GiB 1.6 >>>> GiB 4.2 >>>> >> > GiB 1.2 TiB 32.48 1.05 61 up osd.39 >>>> >> > 40 ssd 1.81929 1.00000 1.8 TiB 573 GiB 567 GiB 1.2 >>>> GiB 4.4 >>>> >> > GiB 1.3 TiB 30.76 1.00 56 up osd.40 >>>> >> > 41 ssd 1.81929 1.00000 1.8 TiB 526 GiB 520 GiB 1.7 >>>> GiB 3.9 >>>> >> > GiB 1.3 TiB 28.21 0.92 54 up osd.41 >>>> >> > 42 ssd 1.81929 1.00000 1.8 TiB 613 GiB 608 GiB 1010 >>>> MiB 4.4 >>>> >> > GiB 1.2 TiB 32.91 1.07 58 up osd.42 >>>> >> > 43 ssd 1.81929 1.00000 1.8 TiB 606 GiB 600 GiB 1.7 >>>> GiB 4.3 >>>> >> > GiB 1.2 TiB 32.51 1.06 61 up osd.43 >>>> >> > 44 ssd 1.81929 1.00000 1.8 TiB 583 GiB 577 GiB 1.6 >>>> GiB 4.2 >>>> >> > GiB 1.3 TiB 31.29 1.02 60 up osd.44 >>>> >> > 45 ssd 1.81929 1.00000 1.8 TiB 618 GiB 613 GiB 1.4 >>>> GiB 4.3 >>>> >> > GiB 1.2 TiB 33.18 1.08 62 up osd.45 >>>> >> > 46 ssd 1.81929 1.00000 1.8 TiB 550 GiB 544 GiB 1.5 >>>> GiB 4.2 >>>> >> > GiB 1.3 TiB 29.50 0.96 54 up osd.46 >>>> >> > 47 ssd 1.81929 1.00000 1.8 TiB 526 GiB 522 GiB 692 >>>> MiB 3.7 >>>> >> > GiB 1.3 TiB 28.25 0.92 50 up osd.47 >>>> >> > -9 29.10864 - 29 TiB 9.0 TiB 8.9 TiB 26 >>>> GiB 68 >>>> >> > GiB 20 TiB 31.04 1.01 - host ud-04 >>>> >> > 48 ssd 1.81929 1.00000 1.8 TiB 540 GiB 534 GiB 2.2 >>>> GiB 3.6 >>>> >> > GiB 1.3 TiB 28.96 0.94 58 up osd.48 >>>> >> > 49 ssd 1.81929 1.00000 1.8 TiB 617 GiB 611 GiB 1.4 >>>> GiB 4.5 >>>> >> > GiB 1.2 TiB 33.11 1.07 61 up osd.49 >>>> >> > 50 ssd 1.81929 1.00000 1.8 TiB 618 GiB 612 GiB 1.2 >>>> GiB 4.8 >>>> >> > GiB 1.2 TiB 33.17 1.08 61 up osd.50 >>>> >> > 51 ssd 1.81929 1.00000 1.8 TiB 618 GiB 612 GiB 1.5 >>>> GiB 4.5 >>>> >> > GiB 1.2 TiB 33.19 1.08 61 up osd.51 >>>> >> > 52 ssd 1.81929 1.00000 1.8 TiB 526 GiB 521 GiB 1.4 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 28.25 0.92 53 up osd.52 >>>> >> > 53 ssd 1.81929 1.00000 1.8 TiB 618 GiB 611 GiB 2.4 >>>> GiB 4.3 >>>> >> > GiB 1.2 TiB 33.17 1.08 66 up osd.53 >>>> >> > 54 ssd 1.81929 1.00000 1.8 TiB 550 GiB 544 GiB 1.5 >>>> GiB 4.3 >>>> >> > GiB 1.3 TiB 29.54 0.96 55 up osd.54 >>>> >> > 55 ssd 1.81929 1.00000 1.8 TiB 527 GiB 522 GiB 1.3 >>>> GiB 4.0 >>>> >> > GiB 1.3 TiB 28.29 0.92 52 up osd.55 >>>> >> > 56 ssd 1.81929 1.00000 1.8 TiB 525 GiB 519 GiB 1.2 >>>> GiB 4.1 >>>> >> > GiB 1.3 TiB 28.16 0.91 52 up osd.56 >>>> >> > 57 ssd 1.81929 1.00000 1.8 TiB 615 GiB 609 GiB 2.3 >>>> GiB 4.2 >>>> >> > GiB 1.2 TiB 33.03 1.07 65 up osd.57 >>>> >> > 58 ssd 1.81929 1.00000 1.8 TiB 527 GiB 522 GiB 1.6 >>>> GiB 3.7 >>>> >> > GiB 1.3 TiB 28.31 0.92 55 up osd.58 >>>> >> > 59 ssd 1.81929 1.00000 1.8 TiB 615 GiB 609 GiB 1.2 >>>> GiB 4.6 >>>> >> > GiB 1.2 TiB 33.01 1.07 60 up osd.59 >>>> >> > 60 ssd 1.81929 1.00000 1.8 TiB 594 GiB 588 GiB 1.2 >>>> GiB 4.4 >>>> >> > GiB 1.2 TiB 31.88 1.03 59 up osd.60 >>>> >> > 61 ssd 1.81929 1.00000 1.8 TiB 616 GiB 610 GiB 1.9 >>>> GiB 4.1 >>>> >> > GiB 1.2 TiB 33.04 1.07 64 up osd.61 >>>> >> > 62 ssd 1.81929 1.00000 1.8 TiB 620 GiB 614 GiB 1.9 >>>> GiB 4.4 >>>> >> > GiB 1.2 TiB 33.27 1.08 63 up osd.62 >>>> >> > 63 ssd 1.81929 1.00000 1.8 TiB 527 GiB 522 GiB 1.5 >>>> GiB 4.0 >>>> >> > GiB 1.3 TiB 28.30 0.92 53 up osd.63 >>>> >> > -11 29.10864 - 29 TiB 9.0 TiB 8.9 TiB 23 >>>> GiB 65 >>>> >> > GiB 20 TiB 30.91 1.00 - host ud-05 >>>> >> > 64 ssd 1.81929 1.00000 1.8 TiB 608 GiB 601 GiB 2.3 >>>> GiB 4.5 >>>> >> > GiB 1.2 TiB 32.62 1.06 65 up osd.64 >>>> >> > 65 ssd 1.81929 1.00000 1.8 TiB 606 GiB 601 GiB 628 >>>> MiB 4.2 >>>> >> > GiB 1.2 TiB 32.53 1.06 57 up osd.65 >>>> >> > 66 ssd 1.81929 1.00000 1.8 TiB 583 GiB 578 GiB 1.3 >>>> GiB 4.3 >>>> >> > GiB 1.2 TiB 31.31 1.02 57 up osd.66 >>>> >> > 67 ssd 1.81929 1.00000 1.8 TiB 537 GiB 533 GiB 436 >>>> MiB 3.6 >>>> >> > GiB 1.3 TiB 28.82 0.94 50 up osd.67 >>>> >> > 68 ssd 1.81929 1.00000 1.8 TiB 541 GiB 535 GiB 2.5 >>>> GiB 3.8 >>>> >> > GiB 1.3 TiB 29.04 0.94 59 up osd.68 >>>> >> > 69 ssd 1.81929 1.00000 1.8 TiB 606 GiB 601 GiB 1.1 >>>> GiB 4.4 >>>> >> > GiB 1.2 TiB 32.55 1.06 59 up osd.69 >>>> >> > 70 ssd 1.81929 1.00000 1.8 TiB 604 GiB 598 GiB 1.8 >>>> GiB 4.1 >>>> >> > GiB 1.2 TiB 32.44 1.05 63 up osd.70 >>>> >> > 71 ssd 1.81929 1.00000 1.8 TiB 606 GiB 600 GiB 1.9 >>>> GiB 4.5 >>>> >> > GiB 1.2 TiB 32.53 1.06 62 up osd.71 >>>> >> > 72 ssd 1.81929 1.00000 1.8 TiB 602 GiB 598 GiB 612 >>>> MiB 4.1 >>>> >> > GiB 1.2 TiB 32.33 1.05 57 up osd.72 >>>> >> > 73 ssd 1.81929 1.00000 1.8 TiB 571 GiB 565 GiB 1.8 >>>> GiB 4.5 >>>> >> > GiB 1.3 TiB 30.65 0.99 58 up osd.73 >>>> >> > 74 ssd 1.81929 1.00000 1.8 TiB 608 GiB 602 GiB 1.8 >>>> GiB 4.2 >>>> >> > GiB 1.2 TiB 32.62 1.06 61 up osd.74 >>>> >> > 75 ssd 1.81929 1.00000 1.8 TiB 536 GiB 531 GiB 1.9 >>>> GiB 3.5 >>>> >> > GiB 1.3 TiB 28.80 0.93 57 up osd.75 >>>> >> > 76 ssd 1.81929 1.00000 1.8 TiB 605 GiB 599 GiB 1.4 >>>> GiB 4.5 >>>> >> > GiB 1.2 TiB 32.48 1.05 60 up osd.76 >>>> >> > 77 ssd 1.81929 1.00000 1.8 TiB 537 GiB 532 GiB 1.2 >>>> GiB 3.9 >>>> >> > GiB 1.3 TiB 28.84 0.94 52 up osd.77 >>>> >> > 78 ssd 1.81929 1.00000 1.8 TiB 525 GiB 520 GiB 1.3 >>>> GiB 3.8 >>>> >> > GiB 1.3 TiB 28.20 0.92 52 up osd.78 >>>> >> > 79 ssd 1.81929 1.00000 1.8 TiB 536 GiB 531 GiB 1.1 >>>> GiB 3.3 >>>> >> > GiB 1.3 TiB 28.76 0.93 53 up osd.79 >>>> >> > TOTAL 146 TiB 45 TiB 44 TiB 119 >>>> GiB 333 >>>> >> > GiB 101 TiB 30.81 >>>> >> > MIN/MAX VAR: 0.91/1.08 STDDEV: 1.90 >>>> >> > >>>> >> > >>>> >> > >>>> >> > Eugen Block <eblock@xxxxxx>, 25 Oca 2024 Per, 16:52 tarihinde şunu >>>> >> yazdı: >>>> >> > >>>> >> >> There is no definitive answer wrt mds tuning. As it is everywhere >>>> >> >> mentioned, it's about finding the right setup for your specific >>>> >> >> workload. If you can synthesize your workload (maybe scale down a >>>> bit) >>>> >> >> try optimizing it in a test cluster without interrupting your >>>> >> >> developers too much. >>>> >> >> But what you haven't explained yet is what are you experiencing >>>> as a >>>> >> >> performance issue? Do you have numbers or a detailed description? >>>> >> >> From the fs status output you didn't seem to have too much >>>> activity >>>> >> >> going on (around 140 requests per second), but that's probably >>>> not the >>>> >> >> usual traffic? What does ceph report in its client IO output? >>>> >> >> Can you paste the 'ceph osd df' output as well? >>>> >> >> Do you have dedicated MDS servers or are they colocated with other >>>> >> >> services? >>>> >> >> >>>> >> >> Zitat von Özkan Göksu <ozkangksu@xxxxxxxxx>: >>>> >> >> >>>> >> >> > Hello Eugen. >>>> >> >> > >>>> >> >> > I read all of your MDS related topics and thank you so much for >>>> your >>>> >> >> effort >>>> >> >> > on this. >>>> >> >> > There is not much information and I couldn't find a MDS tuning >>>> guide >>>> >> at >>>> >> >> > all. It seems that you are the correct person to discuss mds >>>> >> debugging >>>> >> >> and >>>> >> >> > tuning. >>>> >> >> > >>>> >> >> > Do you have any documents or may I learn what is the proper way >>>> to >>>> >> debug >>>> >> >> > MDS and clients ? >>>> >> >> > Which debug logs will guide me to understand the limitations >>>> and will >>>> >> >> help >>>> >> >> > to tune according to the data flow? >>>> >> >> > >>>> >> >> > While searching, I find this: >>>> >> >> > >>>> >> >> >>>> >> >>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/ >>>> >> >> > quote:"A user running VSCodium, keeping 15k caps open.. the >>>> >> opportunistic >>>> >> >> > caps recall eventually starts recalling those but the (el7 >>>> kernel) >>>> >> client >>>> >> >> > won't release them. Stopping Codium seems to be the only way to >>>> >> release." >>>> >> >> > >>>> >> >> > Because of this I think I also need to play around with the >>>> client >>>> >> side >>>> >> >> too. >>>> >> >> > >>>> >> >> > My main goal is increasing the speed and reducing the latency >>>> and I >>>> >> >> wonder >>>> >> >> > if these ideas are correct or not: >>>> >> >> > - Maybe I need to increase client side cache size because via >>>> each >>>> >> >> client, >>>> >> >> > multiple users request a lot of objects and clearly the >>>> >> >> > client_cache_size=16 default is not enough. >>>> >> >> > - Maybe I need to increase client side maximum cache limit for >>>> >> >> > object "client_oc_max_objects=1000 to 10000" and data >>>> >> >> "client_oc_size=200mi >>>> >> >> > to 400mi" >>>> >> >> > - The client cache cleaning threshold is not aggressive enough >>>> to keep >>>> >> >> the >>>> >> >> > free cache size in the desired range. I need to make it >>>> aggressive but >>>> >> >> this >>>> >> >> > should not reduce speed and increase latency. >>>> >> >> > >>>> >> >> > mds_cache_memory_limit=4gi to 16gi >>>> >> >> > client_oc_max_objects=1000 to 10000 >>>> >> >> > client_oc_size=200mi to 400mi >>>> >> >> > client_permissions=false #to reduce latency. >>>> >> >> > client_cache_size=16 to 128 >>>> >> >> > >>>> >> >> > >>>> >> >> > What do you think? >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> >>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx