clients failing to advance oldest client/flush tid

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Mon, 9 Oct 2017 09:21:13 +0100

Dear All,

We have a new cluster based on v12.2.1

After three days of copying 300TB data into cephfs,
we have started getting the following Health errors:

# ceph health
HEALTH_WARN 9 clients failing to advance oldest client/flush tid;
1 MDSs report slow requests; 1 MDSs behind on trimming

ceph-mds.ceph1.log shows entries like:

2017-10-09 08:42:30.935955 7feeaf263700  0 log_channel(cluster) log
[WRN] : client.5023 does not advance its oldest_client_tid (5760998),
100000 completed requests recorded in session

Performance has been very good; parallel rsync was running at 1.1 >
2GB/s, allowing us to copy 300TB of data in 72 hours.

[root@ceph1 ceph]# ceph df
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
    730T      330T         400T         54.80
POOLS:
    NAME         ID     USED     %USED     MAX AVAIL     OBJECTS
    ecpool       1      316T     62.24          153T     89269703
    mds_nvme     2      188G      8.18          706G       368806

The cluster has 10 nodes, each with 10x 8TB drives.
We are using EC8+2, no upper tier, i.e. allow_ec_overwrites true.
Four nodes have nvme drives, used for 3x replicated MDS metadata.

We have a single MDS server, snapshot cephfs every 10 minutes, then
delete all snapshots older than 24 hours, apart from midnight snapshots.

We use ceph-fuse client on all OSD nodes. The parallel rsync is run
directly on them.  Hardware consists of dual Xeon E5-2620v4, with 64GB
ram, 10Gb eth, OS is SL 7.4.

Any ideas?

thanks,

Jake

--
 Jake Grimmett

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com