So the "backfill_tooful" was an old state; it disappeared after I reweighted. Yesterday, I even set up the Ceph system's tunables to optimal, added one more osd, let it rebalance, and then after rebalancing, I ran a "ceph osd reweight-by-utilization 105". After several hours, though, CEPH stabilized (that is no more recovery), but the final state is worse than before. So here are my questions (I also included the results of "ceph -s" right after these questions):
1) why are 153 pages in "active+remapped" but not going anywhere? Shouldn't they be more like "active+remapped+wait_backfill" instead?
2) Why are 10 pages "active+remapped+backfilling" but there is no actual activity occurring in CEPH? Shouldn't it instead say "active+remapped+wait_backfill+backfill_toofull"
3) Why is there a backfill_tooful when all my osds are well under 95% full -- in fact, they are all under 81% full (as determined by "df -h" command?) (One theory I have is that the "too_full" percentage is based NOT on the actual physical space on the OSD, but on the *reweighted* physical space. Is this theory accurate?
4) When I did a "ceph pg dump", I saw that all 153 pages that are in active+remapped have only 1 OSD in the "up" state but 2 OSDs in the "acting" state. I'm confused as to the difference between "up" and "acting" -- does this scenario mean that if I lose 1 OSD that in the "up" state, I lose data for that page? Or does the "acting" mean that the page data is still on 2 OSDs, so I can afford to lose 1 OSD.
--> ceph -s produces:
================
[root@ia2 ceph]# ceph -s
cluster 14f78538-6085-43f9-ac80-e886ca4de119
health HEALTH_WARN 10 pgs backfill; 5 pgs backfill_toofull; 10 pgs backfilling; 173 pgs stuck unclean; recovery 44940/5858368 objects degraded (0.767%)
monmap e9: 3 mons at {ia1=192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0}, election epoch 500, quorum 0,1,2 ia1,ia2,ia3
osdmap e9700: 23 osds: 23 up, 23 in
pgmap v2003396: 1500 pgs, 1 pools, 11225 GB data, 2841 kobjects
22452 GB used, 23014 GB / 45467 GB avail
44940/5858368 objects degraded (0.767%)
1327 active+clean
5 active+remapped+wait_backfill
5 active+remapped+wait_backfill+backfill_toofull
153 active+remapped
10 active+remapped+backfilling
client io 4369 kB/s rd, 64377 B/s wr, 26 op/s
==========
On Sun, Feb 23, 2014 at 8:09 PM, Gautam Saxena <gsaxena@xxxxxxxxxxx> wrote:
I have 19 pages that are stuck unclean (see below result of ceph -s). This occurred after I executed a "ceph osd reweight-by-utilization 108" to resolve problems with "backfill_too_full" messages, which I believe occurred because my OSDs sizes vary significantly in size (from a low of 600GB to a high of 3 TB). How can I get ceph to get these pages out of stuck-unclean? (And why is this occurring anyways?) My best guess of how to fix (though I don't know why) is that I need to run:ceph osd crush tunables optimal.However, my kernel version (on a fully up-to-date Centos 6.5) is 2.6.32, which is well below the minimum required version of 3.6 that's stated in the documentation (http://ceph.com/docs/master/rados/operations/crush-map/) -- so if I must run "ceph osd crush tunables optimal" to fix this problem, I presume I must upgrade my kernel first, right?...Any thoughts or am I chasing the wrong solution -- I want to avoid kernel upgrade unless it's needed.)=====================[root@ia2 ceph4]# ceph -scluster 14f78538-6085-43f9-ac80-e886ca4de119health HEALTH_WARN 19 pgs backfilling; 19 pgs stuck unclean; recovery 42959/5511127 objects degraded (0.779%)monmap e9: 3 mons at {ia1=192.168.1.11:6789/0,ia2=192.168.1.12:6789/0,ia3=192.168.1.13:6789/0}, election epoch 496, quorum 0,1,2 ia1,ia2,ia3osdmap e7931: 23 osds: 23 up, 23 inpgmap v1904820: 1500 pgs, 1 pools, 10531 GB data, 2670 kobjects18708 GB used, 26758 GB / 45467 GB avail42959/5511127 objects degraded (0.779%)1481 active+clean19 active+remapped+backfillingclient io 1457 B/s wr, 0 op/s[root@ia2 ceph4]# ceph -vceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)[root@ia2 ceph4]# uname -r2.6.32-431.3.1.el6.x86_64====
Gautam Saxena
President & CEO
Integrated Analysis Inc.
Making Sense of Data.™
Biomarker Discovery Software | Bioinformatics Services | Data Warehouse Consulting | Data Migration Consulting
(301) 760-3077 office
(240) 479-4272 direct
(301) 560-3463 fax
(301) 560-3463 fax
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com