Re: pages stuck unclean (but remapped)

Gautam Saxena <gsaxena@xxxxxxxxxxx> · Tue, 25 Feb 2014 12:19:44 -0500

So the "backfill_tooful" was an old state; it disappeared after I reweighted. Yesterday, I even set up the Ceph system's tunables to optimal, added one more osd, let it rebalance, and then after rebalancing, I ran a "ceph osd reweight-by-utilization 105". After several hours, though, CEPH stabilized (that is no more recovery), but the final state is worse than before.  So here are my questions (I also included the results of "ceph -s" right after these questions):

1) why are 153 pages in "active+remapped" but not going anywhere? Shouldn't they be more like "active+remapped+wait_backfill" instead?
2) Why are 10 pages "active+remapped+backfilling" but there is no actual activity occurring in CEPH? Shouldn't it instead say "active+remapped+wait_backfill+backfill_toofull"

3) Why is there a backfill_tooful when all my osds are well under 95% full -- in fact, they are all under 81% full (as determined by "df -h" command?) (One theory I have is that the "too_full" percentage is based NOT on the actual physical space on the OSD, but on the *reweighted* physical space. Is this theory accurate?

4) When I did a "ceph pg dump", I saw that all 153 pages that are in active+remapped have only 1 OSD in the "up" state but 2 OSDs in the "acting" state. I'm confused as to the difference between "up" and "acting" -- does this scenario mean that if I lose 1 OSD that in the "up" state, I lose data for that page? Or does the "acting" mean that the page data is still on 2 OSDs, so I can afford to lose 1 OSD.

--> ceph -s produces:
================
[root@ia2 ceph]# ceph -s
    cluster 14f78538-6085-43f9-ac80-e886ca4de119
     health HEALTH_WARN 10 pgs backfill; 5 pgs backfill_toofull; 10 pgs backfilling; 173 pgs stuck unclean; recovery 44940/5858368 objects degraded (0.767%)

     monmap e9: 3 mons at {ia1=192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0}, election epoch 500, quorum 0,1,2 ia1,ia2,ia3

     osdmap e9700: 23 osds: 23 up, 23 in
      pgmap v2003396: 1500 pgs, 1 pools, 11225 GB data, 2841 kobjects
            22452 GB used, 23014 GB / 45467 GB avail
            44940/5858368 objects degraded (0.767%)

                1327 active+clean
                   5 active+remapped+wait_backfill
                   5 active+remapped+wait_backfill+backfill_toofull
                 153 active+remapped

                  10 active+remapped+backfilling
  client io 4369 kB/s rd, 64377 B/s wr, 26 op/s
==========

On Sun, Feb 23, 2014 at 8:09 PM, Gautam Saxena <gsaxena@xxxxxxxxxxx> wrote:

I have 19 pages that are stuck unclean (see below result of ceph -s). This occurred after I executed a "ceph osd reweight-by-utilization 108" to resolve problems with "backfill_too_full" messages, which I believe occurred because my OSDs sizes vary significantly in size (from a low of 600GB to a high of 3 TB). How can I get ceph to get these pages out of stuck-unclean? (And why is this occurring anyways?) My best guess of how to fix (though I don't know why) is that I need to run: 

ceph osd crush tunables optimal.

However, my kernel version (on a fully up-to-date Centos 6.5) is 2.6.32, which is well below the minimum required version of 3.6 that's stated in the documentation (http://ceph.com/docs/master/rados/operations/crush-map/) -- so if I must run "ceph osd crush tunables optimal" to fix this problem, I presume I must upgrade my kernel first, right?...Any thoughts or am I chasing the wrong solution -- I want to avoid kernel upgrade unless it's needed.)

=====================

[root@ia2 ceph4]# ceph -s
    cluster 14f78538-6085-43f9-ac80-e886ca4de119
     health HEALTH_WARN 19 pgs backfilling; 19 pgs stuck unclean; recovery 42959/5511127 objects degraded (0.779%)

     monmap e9: 3 mons at {ia1=192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0}, election epoch 496, quorum 0,1,2 ia1,ia2,ia3

     osdmap e7931: 23 osds: 23 up, 23 in
      pgmap v1904820: 1500 pgs, 1 pools, 10531 GB data, 2670 kobjects
            18708 GB used, 26758 GB / 45467 GB avail
            42959/5511127 objects degraded (0.779%)

                1481 active+clean
                  19 active+remapped+backfilling
  client io 1457 B/s wr, 0 op/s

[root@ia2 ceph4]# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

[root@ia2 ceph4]# uname -r
2.6.32-431.3.1.el6.x86_64

====

-- 
Gautam Saxena 

President & CEO

Integrated Analysis Inc.

Making Sense of Data.™

Biomarker Discovery Software | Bioinformatics Services | Data Warehouse Consulting | Data Migration Consulting
www.i-a-inc.com 

gsaxena@xxxxxxxxxxx
(301) 760-3077  office

(240) 479-4272  direct
(301) 560-3463  fax

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com