Re: clients failing to advance oldest client/flush tid

John Spray <jspray@xxxxxxxxxx> · Mon, 9 Oct 2017 16:37:56 +0100

On Mon, Oct 9, 2017 at 9:21 AM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
> Dear All,
>
> We have a new cluster based on v12.2.1
>
> After three days of copying 300TB data into cephfs,
> we have started getting the following Health errors:
>
> # ceph health
> HEALTH_WARN 9 clients failing to advance oldest client/flush tid;
>
> 1 MDSs report slow requests; 1 MDSs behind on trimming
>
> ceph-mds.ceph1.log shows entries like:
>
> 2017-10-09 08:42:30.935955 7feeaf263700  0 log_channel(cluster) log
> [WRN] : client.5023 does not advance its oldest_client_tid (5760998),
> 100000 completed requests recorded in session

This is something to be quite wary of -- because something is going
wrong with the client completing its requests, the MDS is unable to
drop its in-memory record of the client requests, and it will be
consuming an increasing amount of memory over time, and trying to
write ever-larger sessions to disk.  Eventually, the MDS will become
unable to write its session table, which is a pretty bad position to
be in.

If it was my cluster, I would be inclined to schedule a nightly
unmount,mount of the client, to keep the system safe while you're
investigating the issue.

> Performance has been very good; parallel rsync was running at 1.1 >
> 2GB/s, allowing us to copy 300TB of data in 72 hours.
>
> [root@ceph1 ceph]# ceph df
> GLOBAL:
>     SIZE     AVAIL     RAW USED     %RAW USED
>     730T      330T         400T         54.80
> POOLS:
>     NAME         ID     USED     %USED     MAX AVAIL     OBJECTS
>     ecpool       1      316T     62.24          153T     89269703
>     mds_nvme     2      188G      8.18          706G       368806
>
>
> The cluster has 10 nodes, each with 10x 8TB drives.
> We are using EC8+2, no upper tier, i.e. allow_ec_overwrites true.
> Four nodes have nvme drives, used for 3x replicated MDS metadata.
>
> We have a single MDS server, snapshot cephfs every 10 minutes, then
> delete all snapshots older than 24 hours, apart from midnight snapshots.

The use of snapshots would be where I'd start investigating: if you
stop making snapshots, and mount a fresh client, does that client
still have the issue when it does a bunch of requests?  You can check
how the client is doing with the "ceph tell mds.<id> session ls"
output: if the "completed requests" value keeps going up indefinitely,
you're having the buggy behaviour.

(Hopefully you got the message about snapshots being experimental when
you enabled the feature.)

> We use ceph-fuse client on all OSD nodes. The parallel rsync is run
> directly on them.  Hardware consists of dual Xeon E5-2620v4, with 64GB
> ram, 10Gb eth, OS is SL 7.4.

Just to check, the ceph-fuse packages on the clients are also 12.2.1?

John

>
> Any ideas?
>
> thanks,
>
> Jake
>
> --
>  Jake Grimmett
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com