On Wed, Feb 3, 2016 at 2:32 AM, Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> wrote: > Hello Gregory, > > in the meantime, I managed to break it further :( > > I tried getting rid of active+remapped pgs and got some undersized > instead.. nto sure whether this can be related.. > > anyways here's the status: > > ceph -s > cluster ff21618e-5aea-4cfe-83b6-a0d2d5b4052a > health HEALTH_WARN > 3 pgs degraded > 2 pgs stale > 3 pgs stuck degraded > 1 pgs stuck inactive > 2 pgs stuck stale > 242 pgs stuck unclean > 3 pgs stuck undersized > 3 pgs undersized > recovery 65/3374343 objects degraded (0.002%) > recovery 186187/3374343 objects misplaced (5.518%) > mds0: Behind on trimming (155/30) > monmap e3: 3 mons at {remrprv1a=10.0.0.1:6789/0,remrprv1b=10.0.0.2:6789/0,remrprv1c=10.0.0.3:6789/0} > election epoch 522, quorum 0,1,2 remrprv1a,remrprv1b,remrprv1c > mdsmap e342: 1/1/1 up {0=remrprv1c=up:active}, 2 up:standby > osdmap e4385: 21 osds: 21 up, 21 in; 238 remapped pgs > pgmap v18679192: 1856 pgs, 7 pools, 4223 GB data, 1103 kobjects > 12947 GB used, 22591 GB / 35538 GB avail > 65/3374343 objects degraded (0.002%) > 186187/3374343 objects misplaced (5.518%) > 1612 active+clean > 238 active+remapped > 3 active+undersized+degraded > 2 stale+active+clean > 1 creating > client io 0 B/s rd, 40830 B/s wr, 17 op/s Yeah, these inactive PGs are basically guaranteed to be the cause of the problem. There are lots of threads about getting PGs healthy again; you should dig around the archives and the documentation troubleshooting page(s). :) -Greg > > >> What's the full output of "ceph -s"? Have you looked at the MDS admin >> socket at all — what state does it say it's in? > > [root@remrprv1c ceph]# ceph --admin-daemon /var/run/ceph/ceph-mds.remrprv1c.asok dump_ops_in_flight > { > "ops": [ > { > "description": "client_request(client.3052096:83 getattr Fs #10000000288 2016-02-03 10:10:46.361591 RETRY=1)", > "initiated_at": "2016-02-03 10:23:25.791790", > "age": 3963.093615, > "duration": 9.519091, > "type_data": [ > "failed to rdlock, waiting", > "client.3052096:83", > "client_request", > { > "client": "client.3052096", > "tid": 83 > }, > [ > { > "time": "2016-02-03 10:23:25.791790", > "event": "initiated" > }, > { > "time": "2016-02-03 10:23:35.310881", > "event": "failed to rdlock, waiting" > } > ] > ] > } > ], > "num_ops": 1 > } > > seems there's some lock stuck here.. > > Killing stuck client (it's postgres trying to access cephfs file > doesn't help..) > > >> -Greg >> >> > >> > My question here is: >> > >> > 1) is there some known issue with hammer 0.94.5 or kernel 4.1.15 >> > which could lead to cephfs hangs? >> > >> > 2) what can I do to debug what is the cause of this hang? >> > >> > 3) is there a way to recover this without hard resetting >> > node with hung cephfs mount? >> > >> > If I could provide more information, please let me know >> > >> > I'd really appreciate any help >> > >> > with best regards >> > >> > nik >> > >> > >> > >> > >> > -- >> > ------------------------------------- >> > Ing. Nikola CIPRICH >> > LinuxBox.cz, s.r.o. >> > 28.rijna 168, 709 00 Ostrava >> > >> > tel.: +420 591 166 214 >> > fax: +420 596 621 273 >> > mobil: +420 777 093 799 >> > www.linuxbox.cz >> > >> > mobil servis: +420 737 238 656 >> > email servis: servis@xxxxxxxxxxx >> > ------------------------------------- >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > > -- > ------------------------------------- > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax: +420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: servis@xxxxxxxxxxx > ------------------------------------- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com