Permanent MDS restarting under load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello.

We have CephFS deployed over Ceph cluster (0.94.5).

We experience constant MDS restarting under high IOPS workload (e.g. rsyncing lots of small mailboxes from another storage to CephFS using ceph-fuse client). First, cluster health goes to HEALTH_WARN state with the following disclaimer:

===
mds0: Behind on trimming (321/30)
===

Also, slow requests start to appear:

===
2 requests are blocked > 32 sec
===

Then, after a while, one of MDSes fails with the following log:

===
лис 10 16:07:41 baikal bash[10122]: 2015-11-10 16:07:41.915540 7f2484f13700 -1 MDSIOContextBase: blacklisted! Restarting...
лис 10 16:07:41 baikal bash[10122]: starting mds.baikal at :/0
лис 10 16:07:42 baikal bash[10122]: 2015-11-10 16:07:42.003189 7f82b477e7c0 -1 mds.-1.0 log_to_monitors {default=true}
===

I guess writing lots of small files bloats MDS log, and MDS doesn't catch trimming in time. That's why it is marked as failed and is replaced by standby MDS. We tried to limit mds_log_max_events to 30 events, but that caused MDS to fail very quickly with the following stacktrace:

===
Stacktrace: https://gist.github.com/4c8a89682e81b0049f3e
===

Is that normal situation, or one could rate-limit client requests? May be there should be additional knobs to tune CephFS for handling such a workload?

Cluster info goes below.

CentOS 7.1, Ceph 0.94.5.

Cluster maps:

===
     osdmap e5894: 20 osds: 20 up, 20 in
      pgmap v8959901: 1024 pgs, 12 pools, 5156 GB data, 23074 kobjects
            20101 GB used, 30468 GB / 50570 GB avail
                1024 active+clean
===

CephFS list:

===
name: myfs, metadata pool: mds_meta_storage, data pools: [mds_xattrs_storage fs_samba fs_pbx fs_misc fs_web fs_mail fs_ott ]
===

Both MDS data and metadata pools are located on PCI-E SSDs:

===
 -9  0.44800 root pcie-ssd
 -7  0.22400     host data-pcie-ssd
7 0.22400 osd.7 up 1.00000 1.00000
 -8  0.22400     host baikal-pcie-ssd
6 0.22400 osd.6 up 1.00000 1.00000

pool 20 'mds_meta_storage' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 4333 flags hashpspool stripe_width 0 pool 21 'mds_xattrs_storage' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 4337 flags hashpspool crash_replay_interval 45 stripe_width 0

mds_meta_storage 20 37422k 0 169G 234714 mds_xattrs_storage 21 0 0 169G 11271588

rule pcie-ssd {
        ruleset 2
        type replicated
        min_size 1
        max_size 2
        step take pcie-ssd
        step chooseleaf firstn 0 type host
        step emit
}
===

There is 1 active MDS as well as 1 stand-by MDS:

===
mdsmap e9035: 1/1/1 up {0=data=up:active}, 1 up:standby
===

Also we have 10 OSDs on HDDs for additional data pools:

===
 -6 37.00000 root sata-hdd-misc
 -4 18.50000     host data-sata-hdd-misc
1 3.70000 osd.1 up 1.00000 1.00000 3 3.70000 osd.3 up 1.00000 1.00000 4 3.70000 osd.4 up 1.00000 1.00000 5 3.70000 osd.5 up 1.00000 1.00000 10 3.70000 osd.10 up 1.00000 1.00000
 -5 18.50000     host baikal-sata-hdd-misc
0 3.70000 osd.0 up 1.00000 1.00000 11 3.70000 osd.11 up 1.00000 1.00000 12 3.70000 osd.12 up 1.00000 1.00000 13 3.70000 osd.13 up 1.00000 1.00000 14 3.70000 osd.14 up 1.00000 1.00000

fs_samba 22 2162G 4.28 3814G 1168619 fs_pbx 23 1551G 3.07 3814G 3908813 fs_misc 24 436G 0.86 3814G 112114 fs_web 25 58642M 0.11 3814G 378946 fs_mail 26 442G 0.88 3814G 6414073 fs_ott 27 0 0 3814G 0

rule sata-hdd-misc {
        ruleset 4
        type replicated
        min_size 1
        max_size 4
        step take sata-hdd-misc
        step choose firstn 2 type host
        step chooseleaf firstn 2 type osd
        step emit
}
===

CephFS folders pool affinity is done via setfattr. For example:

===
# file: mail
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=fs_mail"
===
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux