Re: clients failing to advance oldest client/flush tid

John Spray <jspray@xxxxxxxxxx> · Mon, 9 Oct 2017 18:01:36 +0100

On Mon, Oct 9, 2017 at 5:52 PM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
> Hi John,
>
> Many thanks for getting back to me.
>
> Yes, I did see the "experimental" label on snapshots...
>
> After reading other posts, I got the impression that cephfs snapshots
> might be OK; provided you used a single active MDS and the latest ceph
> fuse client, both of which we have.
>
> Anyhow as you predicted, the flush errors led to our MDS server
> crashing, and crashing badly; The MDS now refuses to restart, giving
> journal replay errors in it's log like this:
>
> /root/ceph/ceph-12.2.1/src/mds/journal.cc: In function 'virtual void
> EOpen::replay(MDSRan
> k*)' thread 7f7950d4a700 time 2017-10-09 17:14:54.115094
> /root/ceph/ceph-12.2.1/src/mds/journal.cc: 2214: FAILED assert(in)
>
> Thankfully this cluster is only used to mirror scratch data, so nothing
> of great value has been lost. I can just wipe everything... :)
>
> However, given the fantastic performance we were getting, and economy of
> erasure encoding, *if* cephfs snapshots were "Bullet-proof", I could
> easily see ourselves, and other places using ceph for large data sets.

I think we might be slightly jumping the gun here -- it's not
necessarily the case that the issue you're seeing was snapshot
related, unless you were able to do the suggested testing to see if a
non-snapshotting workload was having the same problems?

> This is frustratingly close to perfect, so can I ask if reliable
> snapshots are very far away? Is this a bug patch in Luminous, or sorry,
> wait for Mimic?

Historically we haven't backported fixes for experimental things, but
I'd leave it to the discretion of people working on cephfs right now
-- there's no rule against it.

Otherwise, yes, I'd expect to wait for Mimic.  However, the divergence
between master and Luminous is quite small at present, so any issues
you can pin down in luminous will help move us towards the stability
you'd like to see in Mimic.

> In the meantime, can anything else be done to reduce the failure rate?

Again, I'm not sure we've actually attributed the issue to anything in
particular yet -- but it should be possible to work it out if you can
reproduce it.

Cheers,
John

> i.e. would it be significantly safer to make a single daily snapshot,
> and only keep 7 of these?
>
> Does snapshot reliability decrease if there is a large delta in the
> number of files, or large amount of data in each snapshot ?
>
> any other tricks that you can suggest are most welcome...
>
> again, many thanks for your time,
>
> Jake
>
>
> On 09/10/17 16:37, John Spray wrote:
>> On Mon, Oct 9, 2017 at 9:21 AM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
>>> Dear All,
>>>
>>> We have a new cluster based on v12.2.1
>>>
>>> After three days of copying 300TB data into cephfs,
>>> we have started getting the following Health errors:
>>>
>>> # ceph health
>>> HEALTH_WARN 9 clients failing to advance oldest client/flush tid;
>>>
>>> 1 MDSs report slow requests; 1 MDSs behind on trimming
>>>
>>> ceph-mds.ceph1.log shows entries like:
>>>
>>> 2017-10-09 08:42:30.935955 7feeaf263700  0 log_channel(cluster) log
>>> [WRN] : client.5023 does not advance its oldest_client_tid (5760998),
>>> 100000 completed requests recorded in session
>>
>> This is something to be quite wary of -- because something is going
>> wrong with the client completing its requests, the MDS is unable to
>> drop its in-memory record of the client requests, and it will be
>> consuming an increasing amount of memory over time, and trying to
>> write ever-larger sessions to disk.  Eventually, the MDS will become
>> unable to write its session table, which is a pretty bad position to
>> be in.
>>
>> If it was my cluster, I would be inclined to schedule a nightly
>> unmount,mount of the client, to keep the system safe while you're
>> investigating the issue.
>>
>>
>>
>>> Performance has been very good; parallel rsync was running at 1.1 >
>>> 2GB/s, allowing us to copy 300TB of data in 72 hours.
>>>
>>> [root@ceph1 ceph]# ceph df
>>> GLOBAL:
>>>     SIZE     AVAIL     RAW USED     %RAW USED
>>>     730T      330T         400T         54.80
>>> POOLS:
>>>     NAME         ID     USED     %USED     MAX AVAIL     OBJECTS
>>>     ecpool       1      316T     62.24          153T     89269703
>>>     mds_nvme     2      188G      8.18          706G       368806
>>>
>>>
>>> The cluster has 10 nodes, each with 10x 8TB drives.
>>> We are using EC8+2, no upper tier, i.e. allow_ec_overwrites true.
>>> Four nodes have nvme drives, used for 3x replicated MDS metadata.
>>>
>>> We have a single MDS server, snapshot cephfs every 10 minutes, then
>>> delete all snapshots older than 24 hours, apart from midnight snapshots.
>>
>> The use of snapshots would be where I'd start investigating: if you
>> stop making snapshots, and mount a fresh client, does that client
>> still have the issue when it does a bunch of requests?  You can check
>> how the client is doing with the "ceph tell mds.<id> session ls"
>> output: if the "completed requests" value keeps going up indefinitely,
>> you're having the buggy behaviour.
>>
>> (Hopefully you got the message about snapshots being experimental when
>> you enabled the feature.)
>>
>>> We use ceph-fuse client on all OSD nodes. The parallel rsync is run
>>> directly on them.  Hardware consists of dual Xeon E5-2620v4, with 64GB
>>> ram, 10Gb eth, OS is SL 7.4.
>>
>> Just to check, the ceph-fuse packages on the clients are also 12.2.1?
>>
>> John
>>
>>>
>>> Any ideas?
>>>
>>> thanks,
>>>
>>> Jake
>>>
>>> --
>>>  Jake Grimmett
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com