Re: mimic: MDS standby-replay causing blocked ops (MDS bug?)

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 20 May 2019 19:34:07 +0800

On Sat, May 18, 2019 at 5:47 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Dear Yan and Stefan,
>
> it happened again and there were only very few ops in the queue. I pulled the ops list and the cache. Please find a zip file here: "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; . Its a bit more than 100MB.
>

MSD cache dump shows there is a snapshot related. Please avoid using
snapshot until we fix the bug.

Regards
Yan, Zheng

> The active MDS failed over to the standby after or during the dump cache operation. Is this expected? As a result, the cluster is healthy and I can't do further diagnostics. In case you need more information, we have to wait until next time.
>
> Some further observations:
>
> There was no load on the system. I start suspecting that this is not a load-induced event. It is also not cause by excessive atime updates, the FS is mounted with relatime. Could it have to do with the large level-2 network (ca. 550 client servers in the same broadcast domain)? I include our kernel tuning profile below, just in case. The cluster networks (back and front) are isolated VLANs, no gateways, no routing.
>
> We run rolling snapshots on the file system. I didn't observe any problems with this, but am wondering if this might be related. We have currently 30 snapshots in total. Here is the output of status and pool ls:
>
> [root@ceph-01 ~]# ceph status # before the MDS failed over
>   cluster:
>     id: ###
>     health: HEALTH_WARN
>             1 MDSs report slow requests
>
>   services:
>     mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
>     mgr: ceph-01(active), standbys: ceph-02, ceph-03
>     mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
>     osd: 192 osds: 192 up, 192 in
>
>   data:
>     pools:   5 pools, 750 pgs
>     objects: 6.35 M objects, 5.2 TiB
>     usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
>     pgs:     750 active+clean
>
> [root@ceph-01 ~]# ceph status # after cache dump and the MDS failed over
>   cluster:
>     id: ###
>     health: HEALTH_OK
>
>   services:
>     mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
>     mgr: ceph-01(active), standbys: ceph-02, ceph-03
>     mds: con-fs-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby
>     osd: 192 osds: 192 up, 192 in
>
>   data:
>     pools:   5 pools, 750 pgs
>     objects: 6.33 M objects, 5.2 TiB
>     usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
>     pgs:     749 active+clean
>              1   active+clean+scrubbing+deep
>
>   io:
>     client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr
>
> [root@ceph-01 ~]# ceph osd pool ls detail # after the MDS failed over
> pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 80 pgp_num 80 last_change 486 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd
>         removed_snaps [1~5]
> pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 300 pgp_num 300 last_change 1759 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 274877906944000 stripe_width 24576 compression_mode aggressive application rbd
>         removed_snaps [1~3]
> pool 3 'sr-rbd-one-stretch' replicated size 4 min_size 2 crush_rule 2 object_hash rjenkins pg_num 20 pgp_num 20 last_change 500 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 0 compression_mode aggressive application rbd
>         removed_snaps [1~7]
> pool 4 'con-fs-meta' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 428 flags hashpspool,nodelete max_bytes 1099511627776 stripe_width 0 application cephfs
> pool 5 'con-fs-data' erasure size 10 min_size 8 crush_rule 6 object_hash rjenkins pg_num 300 pgp_num 300 last_change 2561 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 32768 compression_mode aggressive application cephfs
>         removed_snaps [2~3d,41~2a,6d~2a,99~c,a6~1e,c6~18,df~3,e3~1,e5~3,e9~1,eb~3,ef~1,f1~1,f3~1,f5~3,f9~1,fb~3,ff~1,101~1,103~1,105~1,107~1,109~1,10b~1,10d~1,10f~1,111~1]
>
> The relevant pools are con-fs-meta and con-fs-data.
>
> Best regards,
> Frank
>
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
> [root@ceph-08 ~]# cat /etc/tuned/ceph/tuned.conf
> [main]
> summary=Settings for ceph cluster. Derived from throughput-performance.
> include=throughput-performance
>
> [vm]
> transparent_hugepages=never
>
> [sysctl]
> # See also:
> # - https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> # - https://www.kernel.org/doc/Documentation/sysctl/net.txt
> # - https://cromwell-intl.com/open-source/performance-tuning/tcp.html
> # - https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/
> # - https://www.spinics.net/lists/ceph-devel/msg21721.html
>
> # Set available PIDs and open files to maximum possible.
> kernel.pid_max=4194304
> fs.file-max=26234859
>
> # Swap options, reduce swappiness.
> vm.zone_reclaim_mode=0
> #vm.dirty_ratio = 20
> vm.dirty_bytes = 629145600
> vm.dirty_background_bytes = 314572800
> vm.swappiness=10
> vm.min_free_kbytes=8388608
>
> # Increase ARP cache size to accommodate large level-2 client network.
> net.ipv4.neigh.default.gc_thresh1 = 1024
> net.ipv4.neigh.default.gc_thresh2 = 2048
> net.ipv4.neigh.default.gc_thresh3 = 4096
> # net.ipv4.neigh.default.gc_interval = 3600
> # net.ipv4.neigh.default.gc_stale_time = 3600
>
> # Increase autotuning TCP buffer limits
> # 10G fiber/64MB buffers (67108864)
> net.core.rmem_max = 67108864
> net.core.wmem_max = 67108864
> net.core.rmem_default = 67108864
> net.core.wmem_default = 67108864
> net.core.optmem_max = 40960
> net.ipv4.tcp_rmem = 22500       218450 67108864
> net.ipv4.tcp_wmem = 22500  81920 67108864
>
> ## Increase number of incoming connections. The value can be raised to bursts of request, default is 128
> net.core.somaxconn = 2048
>
> ## Increase number of incoming connections backlog, default is 1000
> net.core.netdev_max_backlog = 50000
>
> ## Maximum number of remembered connection requests, default is 128
> net.ipv4.tcp_max_syn_backlog = 30000
>
> ## Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks, default is 8192
> # Not needed in isolated network. Default (131072) should be fine.
> #net.ipv4.tcp_max_tw_buckets = 2000000
>
> # Recycle and Reuse TIME_WAIT sockets faster, default is 0 for both
> # Does not exist: net.ipv4.tcp_tw_recycle = 1
> # net.ipv4.tcp_tw_reuse = 1
>
> ## Decrease TIME_WAIT seconds, default is 30 seconds
> net.ipv4.tcp_fin_timeout = 10
>
> ## Tells the system whether it should start at the default window size only for TCP connections
> ## that have been idle for too long, default is 1
> net.ipv4.tcp_slow_start_after_idle = 0
>
> #If your servers talk UDP, also up these limits, default is 4096
> net.ipv4.udp_rmem_min = 8192
> net.ipv4.udp_wmem_min = 8192
>
> ## Disable source redirects
> ## Default is 1
> # net.ipv4.conf.all.send_redirects = 0
> # net.ipv4.conf.all.accept_redirects = 0
>
> ## Disable source routing, default is 0
> # net.ipv4.conf.all.accept_source_route = 0
>
> ####################################################################
> ## Default values collected on different servers prior to tuning. ##
> ####################################################################
>
> # MON hosts:
> #vm.min_free_kbytes = 90112
> #net.ipv4.tcp_mem = 381111      508151  762222
> #net.ipv4.tcp_rmem = 4096       87380   6291456
> #net.ipv4.tcp_wmem = 4096       16384   4194304
>
> # OSD hosts:
> #vm.min_free_kbytes = 90112
> #net.ipv4.tcp_mem = 767376      1023171 1534752
> #net.ipv4.tcp_rmem = 4096       87380   6291456
> #net.ipv4.tcp_wmem = 4096       16384   4194304
>
> # MDS hosts:
> #net.ipv4.tcp_mem = 1539954     2053272 3079908
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com