Re: CephFS: client hangs

Ashley Merrick <singapore@xxxxxxxxxxxxxx> · Mon, 18 Feb 2019 22:59:11 +0800

I know this may sound simple.
Have you tried raising the PG per an OSD limit, I'm sure I have seen in the past people with the same kind of issue as you and was just I/O being blocked due to a limit but not actively logged.

mon_max_pg_per_osd = 400 

In the ceph.conf and then restart all the services / or inject the config into the running admin

On Mon, Feb 18, 2019 at 10:55 PM Hennen, Christian <christian.hennen@xxxxxxxxxxxx> wrote:
Dear Community,

we are running a Ceph Luminous Cluster with CephFS (Bluestore OSDs). During setup, we made the mistake of configuring the OSDs on RAID Volumes. Initially our cluster consisted of 3 nodes, each housing 1 OSD. Currently, we are in the process of remediating this. After a loss of metadata (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025612.html) due to resetting the journal (journal entries were not being flushed fast enough), we managed to bring the cluster back up and started adding 2 additional nodes (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027563.html) .

After adding the two additional nodes, we increased the number of placement groups to not only accomodate the new nodes, but also to prepare for reinstallation of the misconfigured nodes. Since then, the number of placement groups per OSD is too high of course. Despite this fact, cluster health remained fine over the last few months.

However, we are currently observing massive problems: Whenever we try to access any folder via CephFS, e.g. by listing its contents, there is no response. Clients are getting blacklisted, but there is no warning. ceph -s shows everything is ok, except for the number of PGs being too high. If I grep for „assert“ or „error“ in any of the logs, nothing comes up. Also, it is not possible to reduce the number of active MDS to 1. After issuing ‚ceph fs set fs_data max_mds 1‘ nothing happens.

Cluster details are available here: https://gitlab.uni-trier.de/snippets/77 

The MDS log  (https://gitlab.uni-trier.de/snippets/79?expanded=true&viewer=simple) contains no „nicely exporting to“ messages as usual, but instead these:
2019-02-15 08:44:52.464926 7fdb13474700  7 mds.0.server try_open_auth_dirfrag: not auth for [dir 0x100011ce7c6 /home/r-admin/ [2,head] rep@1.1 dir_auth=1 state=0 f(v4 m2019-02-14 13:19:41.300993 80=48+32) n(v11339 rc2019-02-14 13:19:41.300993 b10116465260 10869=10202+667) hs=7+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x564343eed100], fw to mds.1

Updates from 12.2.8 to 12.2.11 I ran last week didn’t help.

Anybody got an idea or a hint where I could look into next? Any help would be greatly appreciated!

Kind regards
Christian Hennen

Project Manager Infrastructural Services
ZIMK University of Trier
Germany
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com