On Thu, Dec 7, 2017 at 3:40 PM, Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > > > we have upgraded our cluster to luminous 12.2.2 and wanted to use a second > MDS for HA purposes. Upgrade itself went well, setting up the second MDS > from the former standby-replay configuration worked, too. > > > But upon load both MDS got stuck and need to be restarted. It starts with > slow requests: > > > 2017-12-06 20:26:25.756475 7fddc4424700 0 log_channel(cluster) log [WRN] : > slow > request 122.370227 seconds old, received at 2017-12-06 20:24:23.386136: > client_ > request(client.15057265:2898 getattr pAsLsXsFs #0x100009de0f2 2017-12-06 > 20:24:2 > 3.244096 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting > > > 0x100009de0f2 is the inode id of the directory we mount as root on most > clients. Running daemonperf for both MDS shows a rising number of journal > segments, accompanied with the corresponding warnings in the ceph log. We > also see other slow requests: > > 2017-12-06 20:26:25.756488 7fddc4424700 0 log_channel(cluster) log [WRN] : > slow > request 180.346068 seconds old, received at 2017-12-06 20:23:25.410295: > client_ > request(client.15163105:549847914 getattr pAs #0x100009de0f2/sge-tmp > 2017-12-06 > 20:23:25.406481 caller_uid=1426, caller_gid=1008{}) currently failed to > authpin > local pins > > This is a client accessing a sub directory of the mount point. > > > On the client side (various Ubuntu kernel using kernel based cephfs) this > leads to CPU lockups if the problem is not fixed fast enough. The clients > need a hard reboot to recover. > > > We have mitigated the problem by disabling the second MDS. The MDS related > configuration is: > > > [mds.ceph-storage-04] > mds_replay_interval = 10 > mds_cache_memory_limit = 10737418240 > > [mds] > mds_beacon_grace = 60 > mds_beacon_interval = 4 > mds_session_timeout = 120 > > > Data pool is on replicated HDD storage, meta data pool on replicated NVME > storage. MDS are colocated with OSDs (12 HDD OSDs + 2 NVME OSDs, 128 GB > RAM). > > > The questions are: > > - what is the minimum kernel version on clients required for multi mds > setups? 4.13. > > - is the problem described above a known problem, e.g. a result of > http://tracker.ceph.com/issues/21975 ? > No. Were there other warnings? such as "Client X failing to respond to capability release". If there were, they are likely the cause of problem. > > Regards, > > Burkhard Linke > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com