Dear Community, we are running a Ceph Luminous Cluster with CephFS (Bluestore OSDs). During setup, we made the mistake of configuring the OSDs on RAID Volumes. Initially our cluster consisted of 3 nodes, each housing 1 OSD. Currently, we are in the process of remediating this. After a loss of metadata (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025612.html) due to resetting the journal (journal entries were not being flushed fast enough), we managed to bring the cluster back up and started adding 2 additional nodes (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027563.html) . After adding the two additional nodes, we increased the number of placement groups to not only accomodate the new nodes, but also to prepare for reinstallation of the misconfigured nodes. Since then, the number of placement groups per OSD is too high of course. Despite this fact, cluster health remained fine over the last few months. However, we are currently observing massive problems: Whenever we try to access any folder via CephFS, e.g. by listing its contents, there is no response. Clients are getting blacklisted, but there is no warning. ceph -s shows everything is ok, except for the number of PGs being too high. If I grep for „assert“ or „error“ in any of the logs, nothing comes up. Also, it is not possible to reduce the number of active MDS to 1. After issuing ‚ceph fs set fs_data max_mds 1‘ nothing happens. Cluster details are available here: https://gitlab.uni-trier.de/snippets/77 The MDS log (https://gitlab.uni-trier.de/snippets/79?expanded=true&viewer=simple) contains no „nicely exporting to“ messages as usual, but instead these: 2019-02-15 08:44:52.464926 7fdb13474700 7 mds.0.server try_open_auth_dirfrag: not auth for [dir 0x100011ce7c6 /home/r-admin/ [2,head] rep@1.1 dir_auth=1 state=0 f(v4 m2019-02-14 13:19:41.300993 80=48+32) n(v11339 rc2019-02-14 13:19:41.300993 b10116465260 10869=10202+667) hs=7+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x564343eed100], fw to mds.1 Updates from 12.2.8 to 12.2.11 I ran last week didn’t help. Anybody got an idea or a hint where I could look into next? Any help would be greatly appreciated! Kind regards Christian Hennen Project Manager Infrastructural Services Germany |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com