On Monday, November 03, 2014 17:34:06 you wrote: > If you have osds that are close to full, you may be hitting 9626. I > pushed a branch based on v0.80.7 with the fix, wip-v0.80.7-9626. > -Sam Thanks Sam I may have been hitting that as well. I certainly hit too_full conditions often. I am able to squeeze PGs off of the too_full OSD by reweighting and then eventually all PGs get to where they want to be. Kind of silly that I have to do this manually though. Could Ceph order the PG movements better? (Is this what your bug fix does in effect?) So, at the moment there are no PG moving around the cluster, but all are not in active+clean. Also, there is one OSD which has blocked requests. The OSD seems idle and restarting the OSD just results in a younger blocked request. ~# ceph -s cluster 7797e50e-f4b3-42f6-8454-2e2b19fa41d6 health HEALTH_WARN 35 pgs down; 208 pgs incomplete; 210 pgs stuck inactive; 210 pgs stuck unclean; 1 requests are blocked > 32 sec monmap e3: 3 mons at {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=144.92.180.139:67 89/0}, election epoch 2996, quorum 0,1,2 mon01,mon02,mon03 osdmap e115306: 24 osds: 24 up, 24 in pgmap v6630195: 8704 pgs, 7 pools, 6344 GB data, 1587 kobjects 12747 GB used, 7848 GB / 20596 GB avail 2 inactive 8494 active+clean 173 incomplete 35 down+incomplete # ceph health detail ... 1 ops are blocked > 8388.61 sec 1 ops are blocked > 8388.61 sec on osd.15 1 osds have slow requests from the log of the osd with the blocked request (osd.15): 2014-11-04 08:57:26.851583 7f7686331700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 3840.430247 secs 2014-11-04 08:57:26.851593 7f7686331700 0 log [WRN] : slow request 3840.430247 seconds old, received at 2014-11-04 07:53:26.421301: osd_op(client.11334078.1:592 rb.0.206609.238e1f29.0000000752e8 [read 512~512] 4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg Other requests (like PG scrubs) are happening without taking a long time on this OSD. Also, this was one of the OSDs which I completely drained, removed from ceph, reformatted, and created again using ceph-deploy. So it is completely created by firefly 0.80.7 code. As Greg requested, output of ceph scrub: 2014-11-04 09:25:58.761602 7f6c0e20b700 0 mon.mon01@0(leader) e3 handle_command mon_command({"prefix": "scrub"} v 0) v1 2014-11-04 09:26:21.320043 7f6c0ea0c700 1 mon.mon01@0(leader).paxos(paxos updating c 11563072..11563575) accept timeout, calling fresh elect ion 2014-11-04 09:26:31.264873 7f6c0ea0c700 0 mon.mon01@0(probing).data_health(2996) update_stats avail 38% total 6948572 used 3891232 avail 268 1328 2014-11-04 09:26:33.529403 7f6c0e20b700 0 log [INF] : mon.mon01 calling new monitor election 2014-11-04 09:26:33.538286 7f6c0e20b700 1 mon.mon01@0(electing).elector(2996) init, last seen epoch 2996 2014-11-04 09:26:38.809212 7f6c0ea0c700 0 log [INF] : mon.mon01@0 won leader election with quorum 0,2 2014-11-04 09:26:40.215095 7f6c0e20b700 0 log [INF] : monmap e3: 3 mons at {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03= 144.92.180.139:6789/0} 2014-11-04 09:26:40.215754 7f6c0e20b700 0 log [INF] : pgmap v6630201: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:40.215913 7f6c0e20b700 0 log [INF] : mdsmap e1: 0/0/1 up 2014-11-04 09:26:40.216621 7f6c0e20b700 0 log [INF] : osdmap e115306: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.227010 7f6c0e20b700 0 log [INF] : pgmap v6630202: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:41.367373 7f6c0e20b700 1 mon.mon01@0(leader).osd e115307 e115307: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.437706 7f6c0e20b700 0 log [INF] : osdmap e115307: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.471558 7f6c0e20b700 0 log [INF] : pgmap v6630203: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:41.497318 7f6c0e20b700 1 mon.mon01@0(leader).osd e115308 e115308: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.533965 7f6c0e20b700 0 log [INF] : osdmap e115308: 24 osds: 24 up, 24 in 2014-11-04 09:26:41.553161 7f6c0e20b700 0 log [INF] : pgmap v6630204: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:42.701720 7f6c0e20b700 1 mon.mon01@0(leader).osd e115309 e115309: 24 osds: 24 up, 24 in 2014-11-04 09:26:42.953977 7f6c0e20b700 0 log [INF] : osdmap e115309: 24 osds: 24 up, 24 in 2014-11-04 09:26:45.776411 7f6c0e20b700 0 log [INF] : pgmap v6630205: 8704 pgs: 2 inactive, 8494 active+clean, 173 incomplete, 35 down+incom plete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:46.767534 7f6c0e20b700 1 mon.mon01@0(leader).osd e115310 e115310: 24 osds: 24 up, 24 in 2014-11-04 09:26:46.817764 7f6c0e20b700 0 log [INF] : osdmap e115310: 24 osds: 24 up, 24 in 2014-11-04 09:26:47.593483 7f6c0e20b700 0 log [INF] : pgmap v6630206: 8704 pgs: 2 inactive, 8489 active+clean, 1 peering, 173 incomplete, 4 remapped, 35 down+incomplete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:48.170586 7f6c0e20b700 0 log [INF] : pgmap v6630207: 8704 pgs: 2 inactive, 8489 active+clean, 1 peering, 173 incomplete, 4 remapped, 35 down+incomplete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:48.381781 7f6c0e20b700 1 mon.mon01@0(leader).osd e115311 e115311: 24 osds: 24 up, 24 in 2014-11-04 09:26:48.484570 7f6c0e20b700 0 log [INF] : osdmap e115311: 24 osds: 24 up, 24 in 2014-11-04 09:26:48.857188 7f6c0e20b700 1 mon.mon01@0(leader).log v4718896 check_sub sending message to client.11353722 128.104.164.197:0/10 07270 with 1 entries (version 4718896) 2014-11-04 09:26:50.565461 7f6c0e20b700 0 log [INF] : pgmap v6630208: 8704 pgs: 8491 active+clean, 1 peering, 173 incomplete, 4 remapped, 35 down+incomplete; 6344 GB data, 12747 GB used, 7848 GB / 20596 GB avail 2014-11-04 09:26:51.432688 7f6c0e20b700 1 mon.mon01@0(leader).log v4718897 check_sub sending message to client.11353722 128.104.164.197:0/1007270 with 3 entries (version 4718897) 2014-11-04 09:26:51.476778 7f6c0e20b700 1 mon.mon01@0(leader).osd e115312 e115312: 24 osds: 24 up, 24 in [... not sure how much to include ...] Looks like that cleared up two inactive PGs... # ceph -s cluster 7797e50e-f4b3-42f6-8454-2e2b19fa41d6 health HEALTH_WARN 35 pgs down; 208 pgs incomplete; 208 pgs stuck inactive; 208 pgs stuck unclean; 1 requests are blocked > 32 sec monmap e3: 3 mons at {mon01=128.104.164.197:6789/0,mon02=128.104.164.198:6789/0,mon03=144.92.180.139:6789/0}, election epoch 3000, quorum 0,1,2 mon01,mon02,mon03 osdmap e115315: 24 osds: 24 up, 24 in pgmap v6630222: 8704 pgs, 7 pools, 6344 GB data, 1587 kobjects 12747 GB used, 7848 GB / 20596 GB avail 8496 active+clean 173 incomplete 35 down+incomplete Thanks for your help, Chad. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com