ceph luminous + multi mds: slow request. behind on trimming, failedto authpin local pins

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,


we have upgraded our cluster to luminous 12.2.2 and wanted to use a second MDS for HA purposes. Upgrade itself went well, setting up the second MDS from the former standby-replay configuration worked, too.


But upon load both MDS got stuck and need to be restarted. It starts with slow requests:


2017-12-06 20:26:25.756475 7fddc4424700  0 log_channel(cluster) log [WRN] : slow  request 122.370227 seconds old, received at 2017-12-06 20:24:23.386136: client_ request(client.15057265:2898 getattr pAsLsXsFs #0x100009de0f2 2017-12-06 20:24:2
3.244096 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting


0x100009de0f2 is the inode id of the directory we mount as root on most clients. Running daemonperf for both MDS shows a rising number of journal segments, accompanied with the corresponding warnings in the ceph log. We also see other slow requests:

2017-12-06 20:26:25.756488 7fddc4424700  0 log_channel(cluster) log [WRN] : slow  request 180.346068 seconds old, received at 2017-12-06 20:23:25.410295: client_ request(client.15163105:549847914 getattr pAs #0x100009de0f2/sge-tmp 2017-12-06 20:23:25.406481 caller_uid=1426, caller_gid=1008{}) currently failed to authpin
local pins

This is a client accessing a sub directory of the mount point.


On the client side (various Ubuntu kernel using kernel based cephfs) this leads to CPU lockups if the problem is not fixed fast enough. The clients need a hard reboot to recover.


We have mitigated the problem by disabling the second MDS. The MDS related configuration is:


[mds.ceph-storage-04]
mds_replay_interval = 10
mds_cache_memory_limit = 10737418240

[mds]
mds_beacon_grace = 60
mds_beacon_interval = 4
mds_session_timeout = 120


Data pool is on replicated HDD storage, meta data pool on replicated NVME storage. MDS are colocated with OSDs (12 HDD OSDs + 2 NVME OSDs, 128 GB RAM).


The questions are:

- what is the minimum kernel version on clients required for multi mds setups?

- is the problem described above a known problem, e.g. a result of http://tracker.ceph.com/issues/21975 ?


Regards,

Burkhard Linke


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux