Re: clients failing to advance oldest client/flush tid

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Mon, 9 Oct 2017 17:52:30 +0100

Hi John,

Many thanks for getting back to me.

Yes, I did see the "experimental" label on snapshots...

After reading other posts, I got the impression that cephfs snapshots
might be OK; provided you used a single active MDS and the latest ceph
fuse client, both of which we have.

Anyhow as you predicted, the flush errors led to our MDS server
crashing, and crashing badly; The MDS now refuses to restart, giving
journal replay errors in it's log like this:

/root/ceph/ceph-12.2.1/src/mds/journal.cc: In function 'virtual void
EOpen::replay(MDSRan
k*)' thread 7f7950d4a700 time 2017-10-09 17:14:54.115094
/root/ceph/ceph-12.2.1/src/mds/journal.cc: 2214: FAILED assert(in)

Thankfully this cluster is only used to mirror scratch data, so nothing
of great value has been lost. I can just wipe everything... :)

However, given the fantastic performance we were getting, and economy of
erasure encoding, *if* cephfs snapshots were "Bullet-proof", I could
easily see ourselves, and other places using ceph for large data sets.

This is frustratingly close to perfect, so can I ask if reliable
snapshots are very far away? Is this a bug patch in Luminous, or sorry,
wait for Mimic?

In the meantime, can anything else be done to reduce the failure rate?

i.e. would it be significantly safer to make a single daily snapshot,
and only keep 7 of these?

Does snapshot reliability decrease if there is a large delta in the
number of files, or large amount of data in each snapshot ?

any other tricks that you can suggest are most welcome...

again, many thanks for your time,

Jake

On 09/10/17 16:37, John Spray wrote:
> On Mon, Oct 9, 2017 at 9:21 AM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
>> Dear All,
>>
>> We have a new cluster based on v12.2.1
>>
>> After three days of copying 300TB data into cephfs,
>> we have started getting the following Health errors:
>>
>> # ceph health
>> HEALTH_WARN 9 clients failing to advance oldest client/flush tid;
>>
>> 1 MDSs report slow requests; 1 MDSs behind on trimming
>>
>> ceph-mds.ceph1.log shows entries like:
>>
>> 2017-10-09 08:42:30.935955 7feeaf263700  0 log_channel(cluster) log
>> [WRN] : client.5023 does not advance its oldest_client_tid (5760998),
>> 100000 completed requests recorded in session
> 
> This is something to be quite wary of -- because something is going
> wrong with the client completing its requests, the MDS is unable to
> drop its in-memory record of the client requests, and it will be
> consuming an increasing amount of memory over time, and trying to
> write ever-larger sessions to disk.  Eventually, the MDS will become
> unable to write its session table, which is a pretty bad position to
> be in.
> 
> If it was my cluster, I would be inclined to schedule a nightly
> unmount,mount of the client, to keep the system safe while you're
> investigating the issue.
> 
> 
> 
>> Performance has been very good; parallel rsync was running at 1.1 >
>> 2GB/s, allowing us to copy 300TB of data in 72 hours.
>>
>> [root@ceph1 ceph]# ceph df
>> GLOBAL:
>>     SIZE     AVAIL     RAW USED     %RAW USED
>>     730T      330T         400T         54.80
>> POOLS:
>>     NAME         ID     USED     %USED     MAX AVAIL     OBJECTS
>>     ecpool       1      316T     62.24          153T     89269703
>>     mds_nvme     2      188G      8.18          706G       368806
>>
>>
>> The cluster has 10 nodes, each with 10x 8TB drives.
>> We are using EC8+2, no upper tier, i.e. allow_ec_overwrites true.
>> Four nodes have nvme drives, used for 3x replicated MDS metadata.
>>
>> We have a single MDS server, snapshot cephfs every 10 minutes, then
>> delete all snapshots older than 24 hours, apart from midnight snapshots.
> 
> The use of snapshots would be where I'd start investigating: if you
> stop making snapshots, and mount a fresh client, does that client
> still have the issue when it does a bunch of requests?  You can check
> how the client is doing with the "ceph tell mds.<id> session ls"
> output: if the "completed requests" value keeps going up indefinitely,
> you're having the buggy behaviour.
> 
> (Hopefully you got the message about snapshots being experimental when
> you enabled the feature.)
> 
>> We use ceph-fuse client on all OSD nodes. The parallel rsync is run
>> directly on them.  Hardware consists of dual Xeon E5-2620v4, with 64GB
>> ram, 10Gb eth, OS is SL 7.4.
> 
> Just to check, the ceph-fuse packages on the clients are also 12.2.1?
> 
> John
> 
>>
>> Any ideas?
>>
>> thanks,
>>
>> Jake
>>
>> --
>>  Jake Grimmett
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com