Re: Cephfs MDS slow requests

Deepak Naidu <dnaidu@xxxxxxxxxx> · Thu, 15 Mar 2018 21:07:32 +0000

David, few inputs based on my working experience on cephFS. Might or might not be relevant to the current issue seen in your cluster.

Create Metadata pool on NVMe. Folks can claim not needed, but I have seen worst perf when on HDD though the Metadata size is very small.
In cephFS, ensure MDS node has enough RAM allocated for MDS cache(this will not improve drastic perf. But some extent). On side note, MDS has some bug related to oversubscribed memory
 usage regardless of the cache settings if you have more than 64GB RAM. Take a look.
http://tracker.ceph.com/issues/21402
http://tracker.ceph.com/issues/22599
https://bugzilla.redhat.com/show_bug.cgi?id=1531679

cephFS is not great for small files(in KB’s) but works great with large file sizes(MB or GB’s). So using like filer(NFS/SMB) use-case needs administration attention.
Next thing to ensure if the large # of inode/file counts in cephFS. Ensure dirfrag, active/active MDS etc tunable are implemented on the luminous version you used on filestore or asking
 users not to store multi-million of small files in one dir(it’s debatable scenario, not sure how much control you have over you customer use-case)
Always use kernel mounts. ceph-fuse are super slow(3-5 times than kernel mounts), I hope you may know this.

--
Deepak 

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of David C

Sent: Wednesday, March 14, 2018 10:46 AM

To: John Spray <jspray@xxxxxxxxxx>

Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>

Subject: Re: [ceph-users] Cephfs MDS slow requests

Thanks, John. I'm pretty sure the root of my slow OSD issues is filestore subfolder splitting.

On Wed, Mar 14, 2018 at 2:17 PM, John Spray <jspray@xxxxxxxxxx> wrote:

On Tue, Mar 13, 2018 at 7:17 PM, David C <dcsysengineer@xxxxxxxxx> wrote:

> Hi All

>

> I have a Samba server that is exporting directories from a Cephfs Kernel

> mount. Performance has been pretty good for the last year but users have

> recently been complaining of short "freezes", these seem to coincide with

> MDS related slow requests in the monitor ceph.log such as:

>

>> 2018-03-13 13:34:58.461030 osd.15 osd.15 
10.10.10.211:6812/13367 5752 :

>> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13

>> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019

>> 3:7cea5bac:::10001a88b8f.00000000:head v 14085'846936) currently commit_sent

>> 2018-03-13 13:34:59.461270 osd.15 osd.15 
10.10.10.211:6812/13367 5754 :

>> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13

>> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020

>> 2:43bdcc3f:::10001e91a91.00000000:head v 14085'21394) currently commit_sent

>> 2018-03-13 14:23:57.409427 osd.30 osd.30 
10.10.10.212:6824/14997 5708 :

>> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13

>> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077

>> 2:6df955ef:::10001e93542.000000c4:head v 14085'21296) currently commit_sent

>> 2018-03-13 14:23:57.409449 osd.30 osd.30 
10.10.10.212:6824/14997 5709 :

>> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13

>> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019

>> 2:a9a56101:::10001e93542.000000c8:head v 14085'20437) currently commit_sent

>> 2018-03-13 14:23:57.409453 osd.30 osd.30 
10.10.10.212:6824/14997 5710 :

>> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13

>> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055

>> 2:57096bbf:::10001e93542.000000d8:head v 14085'21147) currently commit_sent

>

>

> --

>

> Looking in the MDS log, with debug set to 4, it's full of "setfilelockrule

> 1" and "setfilelockrule 2":

>

>> 2018-03-13 14:23:00.446905 7fde43e73700  4 mds.0.server

>> handle_client_request client_request(client.9174621:141162337

>> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120,

>> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155,

>> caller_gid=1131{}) v2

>> 2018-03-13 14:23:00.447050 7fde43e73700  4 mds.0.server

>> handle_client_request client_request(client.9174621:141162338

>> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start 0,

>> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,

>> caller_gid=0{}) v2

>> 2018-03-13 14:23:00.447258 7fde43e73700  4 mds.0.server

>> handle_client_request client_request(client.9174621:141162339

>> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start 0,

>> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,

>> caller_gid=0{}) v2

>> 2018-03-13 14:23:00.447393 7fde43e73700  4 mds.0.server

>> handle_client_request client_request(client.9174621:141162340

>> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124,

>> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155,

>> caller_gid=1131{}) v2

The MDS reporting slow requests when file locking in use is a bug, the

ticket is:

http://tracker.ceph.com/issues/22428

Probably only indirectly related to the stuck OSD requests: perhaps

the application itself is having trouble promptly releasing locks

because it is hung up on flushing its data to slow OSDs.

John

>

> --

>

> I don't have a particularly good monitoring set up on this cluster yet, but

> a cursory look at a few things such as iostat doesn't seem to suggest OSDs

> are being hammered.

>

> Some questions:

>

> 1) Can anyone recommend a way of diagnosing this issue?

> 2) Are the multiple "setfilelockrule" per inode to be expected? I assume

> this is something to do with the Samba oplocks.

> 3) What's the recommended highest MDS debug setting before performance

> starts to be adversely affected (I'm aware log files will get huge)?

> 4) What's the best way of matching inodes in the MDS log to the file names

> in cephfs?

>

> Hardware/Versions:

>

> Luminous 12.1.1

> Cephfs client 3.10.0-514.2.2.el7.x86_64

> Samba 4.4.4

> 4 node cluster, each node 1xIntel 3700 NVME, 12x SATA, 40Gbps networking

>

> Thanks in advance!

>

> Cheers,

> David

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure 
or distribution is prohibited.  If you are not the intended recipient, 
please contact the sender by reply email and destroy all copies of the original 
message. 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com