Re: emperor -> firefly 0.80.7 upgrade problem

Samuel Just <sam.just@xxxxxxxxxxx> · Tue, 4 Nov 2014 16:32:56 -0800



Incomplete usually means the pgs do not have any complete copies.  Did
you previously have more osds?
-Sam

On Tue, Nov 4, 2014 at 7:37 AM, Chad Seys <cwseys@xxxxxxxxxxxxxxxx> wrote:
> On Monday, November 03, 2014 17:34:06 you wrote:
>> If you have osds that are close to full, you may be hitting 9626.  I
>> pushed a branch based on v0.80.7 with the fix, wip-v0.80.7-9626.
>> -Sam
>
> Thanks Sam  I may have been hitting that as well.  I certainly hit too_full
> conditions often.  I am able to squeeze PGs off of the too_full OSD by
> reweighting and then eventually all PGs get to where they want to be.  Kind of
> silly that I have to do this manually though.  Could Ceph order the PG
> movements better? (Is this what your bug fix does in effect?)
>
>
> So, at the moment there are no PG moving around the cluster, but all are not
> in active+clean. Also, there is one OSD which has blocked requests.  The OSD
> seems idle and restarting the OSD just results in a younger blocked request.
>
> ~# ceph -s
>     cluster 7797e50e-f4b3-42f6-8454-2e2b19fa41d6
>      health HEALTH_WARN 35 pgs down; 208 pgs incomplete; 210 pgs stuck
> inactive; 210 pgs stuck unclean; 1 requests are blocked > 32 sec
>      monmap e3: 3 mons at
> {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=144.92.180.139:67
> 89/0}, election epoch 2996, quorum 0,1,2 mon01,mon02,mon03
>      osdmap e115306: 24 osds: 24 up, 24 in
>       pgmap v6630195: 8704 pgs, 7 pools, 6344 GB data, 1587 kobjects
>             12747 GB used, 7848 GB / 20596 GB avail
>                    2 inactive
>                 8494 active+clean
>                  173 incomplete
>                   35 down+incomplete
>
> # ceph health detail
> ...
> 1 ops are blocked > 8388.61 sec
> 1 ops are blocked > 8388.61 sec on osd.15
> 1 osds have slow requests
>
> from the log of the osd with the blocked request (osd.15):
> 2014-11-04 08:57:26.851583 7f7686331700  0 log [WRN] : 1 slow requests, 1
> included below; oldest blocked for > 3840.430247 secs
> 2014-11-04 08:57:26.851593 7f7686331700  0 log [WRN] : slow request
> 3840.430247 seconds old, received at 2014-11-04 07:53:26.421301:
> osd_op(client.11334078.1:592 rb.0.206609.238e1f29.0000000752e8 [read 512~512]
> 4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg
>
>
> Other requests (like PG scrubs) are happening without taking a long time on
> this OSD.
> Also, this was one of the OSDs which I completely drained, removed from ceph,
> reformatted, and created again using ceph-deploy.  So it is completely created
> by firefly 0.80.7 code.
>
>
> As Greg requested, output of ceph scrub:
>
> 2014-11-04 09:25:58.761602 7f6c0e20b700  0 mon.mon01@0(leader) e3
> handle_command mon_command({"prefix": "scrub"} v 0) v1
> 2014-11-04 09:26:21.320043 7f6c0ea0c700  1 mon.mon01@0(leader).paxos(paxos
> updating c 11563072..11563575) accept timeout, calling fresh elect
> ion
> 2014-11-04 09:26:31.264873 7f6c0ea0c700  0
> mon.mon01@0(probing).data_health(2996) update_stats avail 38% total 6948572
> used 3891232 avail 268
> 1328
> 2014-11-04 09:26:33.529403 7f6c0e20b700  0 log [INF] : mon.mon01 calling new
> monitor election
> 2014-11-04 09:26:33.538286 7f6c0e20b700  1 mon.mon01@0(electing).elector(2996)
> init, last seen epoch 2996
> 2014-11-04 09:26:38.809212 7f6c0ea0c700  0 log [INF] : mon.mon01@0 won leader
> election with quorum 0,2
> 2014-11-04 09:26:40.215095 7f6c0e20b700  0 log [INF] : monmap e3: 3 mons at
> {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=
> 144.92.180.139:6789/0}
> 2014-11-04 09:26:40.215754 7f6c0e20b700  0 log [INF] : pgmap v6630201: 8704
> pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
> plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
> 2014-11-04 09:26:40.215913 7f6c0e20b700  0 log [INF] : mdsmap e1: 0/0/1 up
> 2014-11-04 09:26:40.216621 7f6c0e20b700  0 log [INF] : osdmap e115306: 24
> osds: 24 up, 24 in
> 2014-11-04 09:26:41.227010 7f6c0e20b700  0 log [INF] : pgmap v6630202: 8704
> pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
> plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
> 2014-11-04 09:26:41.367373 7f6c0e20b700  1 mon.mon01@0(leader).osd e115307
> e115307: 24 osds: 24 up, 24 in
> 2014-11-04 09:26:41.437706 7f6c0e20b700  0 log [INF] : osdmap e115307: 24
> osds: 24 up, 24 in
> 2014-11-04 09:26:41.471558 7f6c0e20b700  0 log [INF] : pgmap v6630203: 8704
> pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
> plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
> 2014-11-04 09:26:41.497318 7f6c0e20b700  1 mon.mon01@0(leader).osd e115308
> e115308: 24 osds: 24 up, 24 in
> 2014-11-04 09:26:41.533965 7f6c0e20b700  0 log [INF] : osdmap e115308: 24
> osds: 24 up, 24 in
> 2014-11-04 09:26:41.553161 7f6c0e20b700  0 log [INF] : pgmap v6630204: 8704
> pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
> plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
> 2014-11-04 09:26:42.701720 7f6c0e20b700  1 mon.mon01@0(leader).osd e115309
> e115309: 24 osds: 24 up, 24 in
> 2014-11-04 09:26:42.953977 7f6c0e20b700  0 log [INF] : osdmap e115309: 24
> osds: 24 up, 24 in
> 2014-11-04 09:26:45.776411 7f6c0e20b700  0 log [INF] : pgmap v6630205: 8704
> pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom
> plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
> 2014-11-04 09:26:46.767534 7f6c0e20b700  1 mon.mon01@0(leader).osd e115310
> e115310: 24 osds: 24 up, 24 in
> 2014-11-04 09:26:46.817764 7f6c0e20b700  0 log [INF] : osdmap e115310: 24
> osds: 24 up, 24 in
> 2014-11-04 09:26:47.593483 7f6c0e20b700  0 log [INF] : pgmap v6630206: 8704
> pgs: 2 inactive, 8489 active+clean, 1 peering, 173 incomplete, 4
> remapped, 35 down+incomplete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB
> avail
> 2014-11-04 09:26:48.170586 7f6c0e20b700  0 log [INF] : pgmap v6630207: 8704
> pgs: 2 inactive, 8489 active+clean, 1 peering, 173 incomplete, 4
> remapped, 35 down+incomplete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB
> avail
> 2014-11-04 09:26:48.381781 7f6c0e20b700  1 mon.mon01@0(leader).osd e115311
> e115311: 24 osds: 24 up, 24 in
> 2014-11-04 09:26:48.484570 7f6c0e20b700  0 log [INF] : osdmap e115311: 24
> osds: 24 up, 24 in
> 2014-11-04 09:26:48.857188 7f6c0e20b700  1 mon.mon01@0(leader).log v4718896
> check_sub sending message to client.11353722 128.104.164.197:0/10
> 07270 with 1 entries (version 4718896)
> 2014-11-04 09:26:50.565461 7f6c0e20b700  0 log [INF] : pgmap v6630208: 8704
> pgs: 8491 active+clean, 1 peering, 173 incomplete, 4 remapped, 35
> down+incomplete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail
> 2014-11-04 09:26:51.432688 7f6c0e20b700  1 mon.mon01@0(leader).log v4718897
> check_sub sending message to client.11353722 128.104.164.197:0/1007270 with 3
> entries (version 4718897)
> 2014-11-04 09:26:51.476778 7f6c0e20b700  1 mon.mon01@0(leader).osd e115312
> e115312: 24 osds: 24 up, 24 in
> [... not sure how much to include ...]
>
>
> Looks like that cleared up two inactive PGs...
>
> # ceph -s
>     cluster 7797e50e-f4b3-42f6-8454-2e2b19fa41d6
>      health HEALTH_WARN 35 pgs down; 208 pgs incomplete; 208 pgs stuck
> inactive; 208 pgs stuck unclean; 1 requests are blocked > 32 sec
>      monmap e3: 3 mons at
> {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=144.92.180.139:6789/0},
> election epoch 3000, quorum 0,1,2 mon01,mon02,mon03
>      osdmap e115315: 24 osds: 24 up, 24 in
>       pgmap v6630222: 8704 pgs, 7 pools, 6344 GB data, 1587 kobjects
>             12747 GB used, 7848 GB / 20596 GB avail
>                 8496 active+clean
>                  173 incomplete
>                   35 down+incomplete
>
> Thanks for your help,
> Chad.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com