1 MDSs behind on trimming (was Re: clients failing to advance oldest client/flush tid)

John Spray <jspray@xxxxxxxxxx> · Tue, 10 Oct 2017 10:39:29 +0100

On Tue, Oct 10, 2017 at 3:48 AM, Nigel Williams
<nigel.williams@xxxxxxxxxxx> wrote:
> On 9 October 2017 at 19:21, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
>> HEALTH_WARN 9 clients failing to advance oldest client/flush tid;
>> 1 MDSs report slow requests; 1 MDSs behind on trimming

(This is the less worrying of the original thread's messages, so I've
edited subject line)

> On a proof-of-concept 12.2.1 cluster (few random files added, 30 OSDs,
> default Ceph settings) I can get the above error by doing this from a
> client:
>
> bonnie++ -s 0 -n 1000 -u 0
>
> This makes 1 million files in a single directory (we wanted to see
> what might break).
>
> This takes a few hours to run but seems to finish without incident.
> Over that time we get this in the logs:

We do sometimes see this in systems that have the metadata pool either
on kinda-slow drives, or on drives that are shared with a very busy
data pool.  If either is the case (i.e. if your OSDs are very busy)
then the warning is probably nothing to worry about (it does make me
wonder if we should make the default journal length longer though).

You can make the system more tolerant of slow metadata writeback by
adjusting mds_log_max_segments upwards (for example, doubling from the
default 30 is not a big deal).

If your OSDs are *not* very busy, and you're still seeing this
warning, then you are hitting a bug and it's worth investigating.

John

>
> root@c0mon-101:/var/log/ceph# zcat ceph-mon.c0mon-101.log.6.gz|fgrep MDS_TRIM
> 2017-10-04 11:14:18.489943 7ff914a26700  0 log_channel(cluster) log
> [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM)
> 2017-10-04 11:14:22.523117 7ff914a26700  0 log_channel(cluster) log
> [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on
> trimming)
> 2017-10-04 11:14:26.589797 7ff914a26700  0 log_channel(cluster) log
> [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM)
> 2017-10-04 11:14:34.614567 7ff914a26700  0 log_channel(cluster) log
> [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on
> trimming)
> 2017-10-04 20:38:22.812032 7ff914a26700  0 log_channel(cluster) log
> [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM)
> 2017-10-04 20:41:14.700521 7ff914a26700  0 log_channel(cluster) log
> [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on
> trimming)
> root@c0mon-101:/var/log/ceph#
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com