Hello Gregory, in the meantime, I managed to break it further :( I tried getting rid of active+remapped pgs and got some undersized instead.. nto sure whether this can be related.. anyways here's the status: ceph -s cluster ff21618e-5aea-4cfe-83b6-a0d2d5b4052a health HEALTH_WARN 3 pgs degraded 2 pgs stale 3 pgs stuck degraded 1 pgs stuck inactive 2 pgs stuck stale 242 pgs stuck unclean 3 pgs stuck undersized 3 pgs undersized recovery 65/3374343 objects degraded (0.002%) recovery 186187/3374343 objects misplaced (5.518%) mds0: Behind on trimming (155/30) monmap e3: 3 mons at {remrprv1a=10.0.0.1:6789/0,remrprv1b=10.0.0.2:6789/0,remrprv1c=10.0.0.3:6789/0} election epoch 522, quorum 0,1,2 remrprv1a,remrprv1b,remrprv1c mdsmap e342: 1/1/1 up {0=remrprv1c=up:active}, 2 up:standby osdmap e4385: 21 osds: 21 up, 21 in; 238 remapped pgs pgmap v18679192: 1856 pgs, 7 pools, 4223 GB data, 1103 kobjects 12947 GB used, 22591 GB / 35538 GB avail 65/3374343 objects degraded (0.002%) 186187/3374343 objects misplaced (5.518%) 1612 active+clean 238 active+remapped 3 active+undersized+degraded 2 stale+active+clean 1 creating client io 0 B/s rd, 40830 B/s wr, 17 op/s > What's the full output of "ceph -s"? Have you looked at the MDS admin > socket at all — what state does it say it's in? [root@remrprv1c ceph]# ceph --admin-daemon /var/run/ceph/ceph-mds.remrprv1c.asok dump_ops_in_flight { "ops": [ { "description": "client_request(client.3052096:83 getattr Fs #10000000288 2016-02-03 10:10:46.361591 RETRY=1)", "initiated_at": "2016-02-03 10:23:25.791790", "age": 3963.093615, "duration": 9.519091, "type_data": [ "failed to rdlock, waiting", "client.3052096:83", "client_request", { "client": "client.3052096", "tid": 83 }, [ { "time": "2016-02-03 10:23:25.791790", "event": "initiated" }, { "time": "2016-02-03 10:23:35.310881", "event": "failed to rdlock, waiting" } ] ] } ], "num_ops": 1 } seems there's some lock stuck here.. Killing stuck client (it's postgres trying to access cephfs file doesn't help..) > -Greg > > > > > My question here is: > > > > 1) is there some known issue with hammer 0.94.5 or kernel 4.1.15 > > which could lead to cephfs hangs? > > > > 2) what can I do to debug what is the cause of this hang? > > > > 3) is there a way to recover this without hard resetting > > node with hung cephfs mount? > > > > If I could provide more information, please let me know > > > > I'd really appreciate any help > > > > with best regards > > > > nik > > > > > > > > > > -- > > ------------------------------------- > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax: +420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: servis@xxxxxxxxxxx > > ------------------------------------- > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@xxxxxxxxxxx -------------------------------------
Attachment:
pgpjNNPpAVMyb.pgp
Description: PGP signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com