I like to understand why I have the "ceph mds slow requests" / "failing to respond to cache pressure" / "failing to respond to capability release" warnings

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Sat, 13 Jun 2020 19:53:36 +0200

I am wondering what the problem is with these error messages I am 
having. I think it is not really related to capability release because 
those messages I am getting at a later time. 

I have two processes being affected by this a rsync on a ceph fuse mount 
and a rsync on a nfs-ganesha mount running at the same time. The ceph 
fuse mount gets at a later time the cache pressure notification.

I think everything starts to go wrong after the mds is reporting on the 
xlock. This message is being logged:

2020-06-13 03:38:36.981 7fb5edd82700  0 log_channel(cluster) log [WRN] : 
slow request 244.377406 seconds old, received at 2020-06-13 
03:34:32.604412: client_request(client.4019800:20354 setattr 
mtime=2020-04-22 12:58:47.000000 atime=2020-06-13 03:34:32.000000 
#0x100001b9177 2020-06-13 03:34:32.604119 caller_uid=500, 
caller_gid=500{500,1,2,3,4,6,10,}) currently failed to xlock, waiting

>From the charts here you can see that caps are climbing and inodes are 
dropping.
https://snapshot.raintank.io/dashboard/snapshot/4ij6AF1JoDzdZNI6WzCyewkn7OqZdJbG?orgId=2
(I am not entirely sure about the correctness of the used units, and the 
ino/caps chart should have dual y-axis)

It looks like that if I start multiple concurrent rsyncs on the 
nfs-ganesha I can trigger this problem. Every time it seems I am getting 
the same xlock listed at the mds log. 

2020-06-13 17:32:24.920 7fb5edd82700  0 log_channel(cluster) log [WRN] : 
slow request 240.294505 seconds old, received at 2020-06-13 
17:28:24.626608: client_request(client.4021284:2468 setattr 
mtime=2020-04-22 12:58:47.000000 atime=2020-06-13 17:28:24.000000 
#0x100001b9177 2020-06-13 17:28:24.626527 caller_uid=500, 
caller_gid=500{500,1,2,3,4,6,10,}) currently failed to xlock, waiting

Question: I have seen this work around/solution[1] being offered a lot, 
but I do not get why I have the xlock. When does one get an xlock?

I think when I can prevent the xlock, I do not need to set osd op queue 
options.

[1]
https://www.mail-archive.com/ceph-users@xxxxxxx/msg04421.html
With 'work around':
osd op queue = wpq
osd op queue cut off = high

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx