Re: CephFS log jam prevention

Reed Dier <reed.dier@xxxxxxxxxxx> · Thu, 7 Dec 2017 09:59:10 -0600

> You can try doubling (several times if necessary) the MDS configs
> `mds_log_max_segments` and `mds_log_max_expiring` to make it more
> aggressively trim its journal. (That may not help since your OSD
> requests are slow.)

This may be obvious, but where is this mds_log located, and what are the bottlenecks for it to get behind?

Is this something that is located on an OSD? Is this located in the metadata pool (which I had previously moved to live on SSDs rather than colocated on the HDDs that the filesystem pool lives on)?
Just curious what would be the bottleneck in the MDS trying to trim this log.

I see the MDS process on the active MDS with decent CPU util, but not any real disk traffic from the MDS side.

So just looking to see what is bounding me with the trimming to keep it so far behind (I increased both max segments and max expiring by 4x each).

Thanks,

Reed

> On Dec 5, 2017, at 4:02 PM, Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
> 
> On Tue, Dec 5, 2017 at 8:07 AM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:
>> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD
>> backed CephFS pool.
>> 
>> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running
>> mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and
>> clients.
> 
> You should try a newer kernel client if possible since the MDS is
> having trouble trimming its cache.
> 
>> HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing
>> to respond to cache pressure; 1 MDSs behind on tr
>> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1
>> pool(s); 242 slow requests are blocked > 32 sec
>> ; 769378 stuck requests are blocked > 4096 sec
>> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
>>    mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by
>> clients, 1 stray files
>> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
>> pressure
>>    mdsdb(mds.0): Many clients (37) failing to respond to cache
>> pressureclient_count: 37
>> MDS_TRIM 1 MDSs behind on trimming
>>    mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30,
>> num_segments: 36252
> 
> See also: http://tracker.ceph.com/issues/21975
> 
> You can try doubling (several times if necessary) the MDS configs
> `mds_log_max_segments` and `mds_log_max_expiring` to make it more
> aggressively trim its journal. (That may not help since your OSD
> requests are slow.)
> 
> -- 
> Patrick Donnelly

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com