Re: mimic: MDS standby-replay causing blocked ops (MDS bug?)

Frank Schilder <frans@xxxxxx> · Sat, 18 May 2019 09:47:17 +0000

Dear Yan and Stefan,

it happened again and there were only very few ops in the queue. I pulled the ops list and the cache. Please find a zip file here: "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; . Its a bit more than 100MB.

The active MDS failed over to the standby after or during the dump cache operation. Is this expected? As a result, the cluster is healthy and I can't do further diagnostics. In case you need more information, we have to wait until next time.

Some further observations:

There was no load on the system. I start suspecting that this is not a load-induced event. It is also not cause by excessive atime updates, the FS is mounted with relatime. Could it have to do with the large level-2 network (ca. 550 client servers in the same broadcast domain)? I include our kernel tuning profile below, just in case. The cluster networks (back and front) are isolated VLANs, no gateways, no routing.

We run rolling snapshots on the file system. I didn't observe any problems with this, but am wondering if this might be related. We have currently 30 snapshots in total. Here is the output of status and pool ls:

[root@ceph-01 ~]# ceph status # before the MDS failed over
  cluster:
    id: ###
    health: HEALTH_WARN
            1 MDSs report slow requests

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-02, ceph-03
    mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
    osd: 192 osds: 192 up, 192 in

  data:
    pools:   5 pools, 750 pgs
    objects: 6.35 M objects, 5.2 TiB
    usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
    pgs:     750 active+clean

[root@ceph-01 ~]# ceph status # after cache dump and the MDS failed over
  cluster:
    id: ###
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-01(active), standbys: ceph-02, ceph-03
    mds: con-fs-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby
    osd: 192 osds: 192 up, 192 in

  data:
    pools:   5 pools, 750 pgs
    objects: 6.33 M objects, 5.2 TiB
    usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
    pgs:     749 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr

[root@ceph-01 ~]# ceph osd pool ls detail # after the MDS failed over
pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 80 pgp_num 80 last_change 486 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd
	removed_snaps [1~5]
pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 300 pgp_num 300 last_change 1759 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 274877906944000 stripe_width 24576 compression_mode aggressive application rbd
	removed_snaps [1~3]
pool 3 'sr-rbd-one-stretch' replicated size 4 min_size 2 crush_rule 2 object_hash rjenkins pg_num 20 pgp_num 20 last_change 500 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 0 compression_mode aggressive application rbd
	removed_snaps [1~7]
pool 4 'con-fs-meta' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 428 flags hashpspool,nodelete max_bytes 1099511627776 stripe_width 0 application cephfs
pool 5 'con-fs-data' erasure size 10 min_size 8 crush_rule 6 object_hash rjenkins pg_num 300 pgp_num 300 last_change 2561 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 219902325555200 stripe_width 32768 compression_mode aggressive application cephfs
	removed_snaps [2~3d,41~2a,6d~2a,99~c,a6~1e,c6~18,df~3,e3~1,e5~3,e9~1,eb~3,ef~1,f1~1,f3~1,f5~3,f9~1,fb~3,ff~1,101~1,103~1,105~1,107~1,109~1,10b~1,10d~1,10f~1,111~1]

The relevant pools are con-fs-meta and con-fs-data.

Best regards,
Frank

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

[root@ceph-08 ~]# cat /etc/tuned/ceph/tuned.conf 
[main]
summary=Settings for ceph cluster. Derived from throughput-performance.
include=throughput-performance

[vm]
transparent_hugepages=never

[sysctl]
# See also:
# - https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
# - https://www.kernel.org/doc/Documentation/sysctl/net.txt
# - https://cromwell-intl.com/open-source/performance-tuning/tcp.html
# - https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/
# - https://www.spinics.net/lists/ceph-devel/msg21721.html

# Set available PIDs and open files to maximum possible.
kernel.pid_max=4194304
fs.file-max=26234859

# Swap options, reduce swappiness.
vm.zone_reclaim_mode=0
#vm.dirty_ratio = 20
vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800
vm.swappiness=10
vm.min_free_kbytes=8388608

# Increase ARP cache size to accommodate large level-2 client network.
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096
# net.ipv4.neigh.default.gc_interval = 3600
# net.ipv4.neigh.default.gc_stale_time = 3600

# Increase autotuning TCP buffer limits
# 10G fiber/64MB buffers (67108864)
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 22500	218450 67108864
net.ipv4.tcp_wmem = 22500  81920 67108864

## Increase number of incoming connections. The value can be raised to bursts of request, default is 128
net.core.somaxconn = 2048

## Increase number of incoming connections backlog, default is 1000
net.core.netdev_max_backlog = 50000

## Maximum number of remembered connection requests, default is 128
net.ipv4.tcp_max_syn_backlog = 30000

## Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks, default is 8192
# Not needed in isolated network. Default (131072) should be fine.
#net.ipv4.tcp_max_tw_buckets = 2000000

# Recycle and Reuse TIME_WAIT sockets faster, default is 0 for both
# Does not exist: net.ipv4.tcp_tw_recycle = 1
# net.ipv4.tcp_tw_reuse = 1

## Decrease TIME_WAIT seconds, default is 30 seconds
net.ipv4.tcp_fin_timeout = 10

## Tells the system whether it should start at the default window size only for TCP connections
## that have been idle for too long, default is 1
net.ipv4.tcp_slow_start_after_idle = 0

#If your servers talk UDP, also up these limits, default is 4096
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

## Disable source redirects
## Default is 1
# net.ipv4.conf.all.send_redirects = 0
# net.ipv4.conf.all.accept_redirects = 0

## Disable source routing, default is 0
# net.ipv4.conf.all.accept_source_route = 0

####################################################################
## Default values collected on different servers prior to tuning. ##
####################################################################

# MON hosts:
#vm.min_free_kbytes = 90112
#net.ipv4.tcp_mem = 381111	508151	762222
#net.ipv4.tcp_rmem = 4096	87380	6291456
#net.ipv4.tcp_wmem = 4096	16384	4194304

# OSD hosts:
#vm.min_free_kbytes = 90112
#net.ipv4.tcp_mem = 767376	1023171	1534752
#net.ipv4.tcp_rmem = 4096	87380	6291456
#net.ipv4.tcp_wmem = 4096	16384	4194304

# MDS hosts:
#net.ipv4.tcp_mem = 1539954	2053272	3079908
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com