Dear Yan, thank you for taking care of this. I removed all snapshots and stopped snapshot creation. Please keep me posted. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Yan, Zheng <ukernel@xxxxxxxxx> Sent: 20 May 2019 13:34:07 To: Frank Schilder Cc: Stefan Kooman; ceph-users@xxxxxxxxxxxxxx Subject: Re: mimic: MDS standby-replay causing blocked ops (MDS bug?) On Sat, May 18, 2019 at 5:47 PM Frank Schilder <frans@xxxxxx> wrote: > > Dear Yan and Stefan, > > it happened again and there were only very few ops in the queue. I pulled the ops list and the cache. Please find a zip file here: "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l" . Its a bit more than 100MB. > MSD cache dump shows there is a snapshot related. Please avoid using snapshot until we fix the bug. Regards Yan, Zheng > The active MDS failed over to the standby after or during the dump cache operation. Is this expected? As a result, the cluster is healthy and I can't do further diagnostics. In case you need more information, we have to wait until next time. > > Some further observations: > > There was no load on the system. I start suspecting that this is not a load-induced event. It is also not cause by excessive atime updates, the FS is mounted with relatime. Could it have to do with the large level-2 network (ca. 550 client servers in the same broadcast domain)? I include our kernel tuning profile below, just in case. The cluster networks (back and front) are isolated VLANs, no gateways, no routing. > > We run rolling snapshots on the file system. I didn't observe any problems with this, but am wondering if this might be related. We have currently 30 snapshots in total. Here is the output of status and pool ls: > > [root@ceph-01 ~]# ceph status # before the MDS failed over > cluster: > id: ### > health: HEALTH_WARN > 1 MDSs report slow requests > > services: > mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 > mgr: ceph-01(active), standbys: ceph-02, ceph-03 > mds: con-fs-1/1/1 up {0=ceph-08=up:active}, 1 up:standby > osd: 192 osds: 192 up, 192 in > > data: > pools: 5 pools, 750 pgs > objects: 6.35 M objects, 5.2 TiB > usage: 5.1 TiB used, 1.3 PiB / 1.3 PiB avail > pgs: 750 active+clean > > [root@ceph-01 ~]# ceph status # after cache dump and the MDS failed over > cluster: > id: ### > health: HEALTH_OK > > services: > mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 > mgr: ceph-01(active), standbys: ceph-02, ceph-03 > mds: con-fs-1/1/1 up {0=ceph-12=up:active}, 1 up:standby > osd: 192 osds: 192 up, 192 in > > data: > pools: 5 pools, 750 pgs > objects: 6.33 M objects, 5.2 TiB > usage: 5.1 TiB used, 1.3 PiB / 1.3 PiB avail > pgs: 749 active+clean > 1 active+clean+scrubbing+deep > > io: > client: 6.3 KiB/s wr, 0 op/s rd, 0 op/s wr > > [root@ceph-01 ~]# ceph osd pool ls detail # after the MDS failed over > pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 80 pgp_num 80 last_change 486 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd > removed_snaps [1~5] > pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 300 pgp_num 300 last_change 1759 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 274877906944000 stripe_width 24576 compression_mode aggressive application rbd > removed_snaps [1~3] > pool 3 'sr-rbd-one-stretch' replicated size 4 min_size 2 crush_rule 2 object_hash rjenkins pg_num 20 pgp_num 20 last_change 500 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 0 compression_mode aggressive application rbd > removed_snaps [1~7] > pool 4 'con-fs-meta' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 428 flags hashpspool,nodelete max_bytes 1099511627776 stripe_width 0 application cephfs > pool 5 'con-fs-data' erasure size 10 min_size 8 crush_rule 6 object_hash rjenkins pg_num 300 pgp_num 300 last_change 2561 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 32768 compression_mode aggressive application cephfs > removed_snaps [2~3d,41~2a,6d~2a,99~c,a6~1e,c6~18,df~3,e3~1,e5~3,e9~1,eb~3,ef~1,f1~1,f3~1,f5~3,f9~1,fb~3,ff~1,101~1,103~1,105~1,107~1,109~1,10b~1,10d~1,10f~1,111~1] > > The relevant pools are con-fs-meta and con-fs-data. > > Best regards, > Frank > > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > [root@ceph-08 ~]# cat /etc/tuned/ceph/tuned.conf > [main] > summary=Settings for ceph cluster. Derived from throughput-performance. > include=throughput-performance > > [vm] > transparent_hugepages=never > > [sysctl] > # See also: > # - https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > # - https://www.kernel.org/doc/Documentation/sysctl/net.txt > # - https://cromwell-intl.com/open-source/performance-tuning/tcp.html > # - https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/ > # - https://www.spinics.net/lists/ceph-devel/msg21721.html > > # Set available PIDs and open files to maximum possible. > kernel.pid_max=4194304 > fs.file-max=26234859 > > # Swap options, reduce swappiness. > vm.zone_reclaim_mode=0 > #vm.dirty_ratio = 20 > vm.dirty_bytes = 629145600 > vm.dirty_background_bytes = 314572800 > vm.swappiness=10 > vm.min_free_kbytes=8388608 > > # Increase ARP cache size to accommodate large level-2 client network. > net.ipv4.neigh.default.gc_thresh1 = 1024 > net.ipv4.neigh.default.gc_thresh2 = 2048 > net.ipv4.neigh.default.gc_thresh3 = 4096 > # net.ipv4.neigh.default.gc_interval = 3600 > # net.ipv4.neigh.default.gc_stale_time = 3600 > > # Increase autotuning TCP buffer limits > # 10G fiber/64MB buffers (67108864) > net.core.rmem_max = 67108864 > net.core.wmem_max = 67108864 > net.core.rmem_default = 67108864 > net.core.wmem_default = 67108864 > net.core.optmem_max = 40960 > net.ipv4.tcp_rmem = 22500 218450 67108864 > net.ipv4.tcp_wmem = 22500 81920 67108864 > > ## Increase number of incoming connections. The value can be raised to bursts of request, default is 128 > net.core.somaxconn = 2048 > > ## Increase number of incoming connections backlog, default is 1000 > net.core.netdev_max_backlog = 50000 > > ## Maximum number of remembered connection requests, default is 128 > net.ipv4.tcp_max_syn_backlog = 30000 > > ## Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks, default is 8192 > # Not needed in isolated network. Default (131072) should be fine. > #net.ipv4.tcp_max_tw_buckets = 2000000 > > # Recycle and Reuse TIME_WAIT sockets faster, default is 0 for both > # Does not exist: net.ipv4.tcp_tw_recycle = 1 > # net.ipv4.tcp_tw_reuse = 1 > > ## Decrease TIME_WAIT seconds, default is 30 seconds > net.ipv4.tcp_fin_timeout = 10 > > ## Tells the system whether it should start at the default window size only for TCP connections > ## that have been idle for too long, default is 1 > net.ipv4.tcp_slow_start_after_idle = 0 > > #If your servers talk UDP, also up these limits, default is 4096 > net.ipv4.udp_rmem_min = 8192 > net.ipv4.udp_wmem_min = 8192 > > ## Disable source redirects > ## Default is 1 > # net.ipv4.conf.all.send_redirects = 0 > # net.ipv4.conf.all.accept_redirects = 0 > > ## Disable source routing, default is 0 > # net.ipv4.conf.all.accept_source_route = 0 > > #################################################################### > ## Default values collected on different servers prior to tuning. ## > #################################################################### > > # MON hosts: > #vm.min_free_kbytes = 90112 > #net.ipv4.tcp_mem = 381111 508151 762222 > #net.ipv4.tcp_rmem = 4096 87380 6291456 > #net.ipv4.tcp_wmem = 4096 16384 4194304 > > # OSD hosts: > #vm.min_free_kbytes = 90112 > #net.ipv4.tcp_mem = 767376 1023171 1534752 > #net.ipv4.tcp_rmem = 4096 87380 6291456 > #net.ipv4.tcp_wmem = 4096 16384 4194304 > > # MDS hosts: > #net.ipv4.tcp_mem = 1539954 2053272 3079908 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com