Re: hammer-0.94.5 + kernel-4.1.15 - cephfs stuck

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 3 Feb 2016 09:07:20 -0800



On Wed, Feb 3, 2016 at 2:32 AM, Nikola Ciprich
<nikola.ciprich@xxxxxxxxxxx> wrote:
> Hello Gregory,
>
> in the meantime, I managed to break it further :(
>
> I tried getting rid of active+remapped pgs and got some undersized
> instead.. nto sure whether this can be related..
>
> anyways here's the status:
>
> ceph -s
>     cluster ff21618e-5aea-4cfe-83b6-a0d2d5b4052a
>      health HEALTH_WARN
>             3 pgs degraded
>             2 pgs stale
>             3 pgs stuck degraded
>             1 pgs stuck inactive
>             2 pgs stuck stale
>             242 pgs stuck unclean
>             3 pgs stuck undersized
>             3 pgs undersized
>             recovery 65/3374343 objects degraded (0.002%)
>             recovery 186187/3374343 objects misplaced (5.518%)
>             mds0: Behind on trimming (155/30)
>      monmap e3: 3 mons at {remrprv1a=10.0.0.1:6789/0,remrprv1b=10.0.0.2:6789/0,remrprv1c=10.0.0.3:6789/0}
>             election epoch 522, quorum 0,1,2 remrprv1a,remrprv1b,remrprv1c
>      mdsmap e342: 1/1/1 up {0=remrprv1c=up:active}, 2 up:standby
>      osdmap e4385: 21 osds: 21 up, 21 in; 238 remapped pgs
>       pgmap v18679192: 1856 pgs, 7 pools, 4223 GB data, 1103 kobjects
>             12947 GB used, 22591 GB / 35538 GB avail
>             65/3374343 objects degraded (0.002%)
>             186187/3374343 objects misplaced (5.518%)
>                 1612 active+clean
>                  238 active+remapped
>                    3 active+undersized+degraded
>                    2 stale+active+clean
>                    1 creating
>   client io 0 B/s rd, 40830 B/s wr, 17 op/s

Yeah, these inactive PGs are basically guaranteed to be the cause of
the problem. There are lots of threads about getting PGs healthy
again; you should dig around the archives and the documentation
troubleshooting page(s). :)
-Greg

>
>
>> What's the full output of "ceph -s"? Have you looked at the MDS admin
>> socket at all — what state does it say it's in?
>
> [root@remrprv1c ceph]# ceph --admin-daemon /var/run/ceph/ceph-mds.remrprv1c.asok dump_ops_in_flight
> {
>     "ops": [
>         {
>             "description": "client_request(client.3052096:83 getattr Fs #10000000288 2016-02-03 10:10:46.361591 RETRY=1)",
>             "initiated_at": "2016-02-03 10:23:25.791790",
>             "age": 3963.093615,
>             "duration": 9.519091,
>             "type_data": [
>                 "failed to rdlock, waiting",
>                 "client.3052096:83",
>                 "client_request",
>                 {
>                     "client": "client.3052096",
>                     "tid": 83
>                 },
>                 [
>                     {
>                         "time": "2016-02-03 10:23:25.791790",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2016-02-03 10:23:35.310881",
>                         "event": "failed to rdlock, waiting"
>                     }
>                 ]
>             ]
>         }
>     ],
>     "num_ops": 1
> }
>
> seems there's some lock stuck here..
>
> Killing stuck client (it's postgres trying to access cephfs file
> doesn't help..)
>
>
>> -Greg
>>
>> >
>> > My question here is:
>> >
>> > 1) is there some known issue with hammer 0.94.5 or kernel 4.1.15
>> > which could lead to cephfs hangs?
>> >
>> > 2) what can I do to debug what is the cause of this hang?
>> >
>> > 3) is there a way to recover this without hard resetting
>> > node with hung cephfs mount?
>> >
>> > If I could provide more information, please let me know
>> >
>> > I'd really appreciate any help
>> >
>> > with best regards
>> >
>> > nik
>> >
>> >
>> >
>> >
>> > --
>> > -------------------------------------
>> > Ing. Nikola CIPRICH
>> > LinuxBox.cz, s.r.o.
>> > 28.rijna 168, 709 00 Ostrava
>> >
>> > tel.:   +420 591 166 214
>> > fax:    +420 596 621 273
>> > mobil:  +420 777 093 799
>> > www.linuxbox.cz
>> >
>> > mobil servis: +420 737 238 656
>> > email servis: servis@xxxxxxxxxxx
>> > -------------------------------------
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
> --
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: servis@xxxxxxxxxxx
> -------------------------------------
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com