On Mon, Apr 6, 2015 at 7:48 AM, Patrick McGarry <pmcgarry@xxxxxxxxxx> wrote: > moving this to ceph-user where it needs to be for eyeballs and responses. :) > > > On Mon, Apr 6, 2015 at 1:34 AM, Paul Evans <paul@xxxxxxxxxxxx> wrote: >> Hello Ceph Community & thanks to anyone with advice on this interesting >> situation... >> ======================================================== >> The Problem: we have 2 pgs out of 6144 that are still stuck in an >> active+remapped state, and we would like to know if there is a targeted & >> specific way to fix this issue (other than just forcing data to re-sort in >> the cluster in a generic re-shuffle). >> >> Background: We initially created a cluster of 6 ceph nodes with an EC pool & >> profile where k=6 and m=2, but missed the configuration item >> "ruleset-failure-domain=host" in the EC_profile (thinking it defaulted to >> =osd). While the ceph cluster allowed us to create the pool and store data >> in it, the fact that we had an EC data spread designed for 8 targets >> (k+m=8), but only had 6 targets (our 6 nodes) eventually caught up to us >> and we ended up with a number of pgs missing chunks of data. Fortunately, >> the data remained 'relatively protected' because ceph remapped the missing >> chunks to alternate hosts, but (of course) that left the pgs in an >> active+remapped state and no way to solve the puzzle. >> The fix? Easy enough: add two more nodes, which we did and *almost* all the >> pgs re-distributed the data appropriately. Except for 4 pgs. This looks like it's just the standard risk of using a pseudo-random algorithm: you need to "randomly" map 8 pieces into 8 slots. Sometimes the CRUSH calculation will return the same 7 slots so many times in a row that it simply fails to get all 8 of them inside of the time bounds that are currently set. If you look through the list archives we've discussed this a few times, especially Loïc in the context of erasure coding. See http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon for the fix. But I think that doc is wrong and you can change the CRUSH rule in use without creating a new pool — right, Loïc? -Greg >> Why 4? We're >> not sure what was unique about those four, but we were able to reduce the >> problem pgs to just 2 (as stated in our Problem section) by doing the >> following: >> >> executed 'ceph pg repair xx.xxx, but nothing happens. After an hour... >> executed 'ceph pg dump_stuck' and noted that 2 of the 4 pgs had a primary >> OSD of 29. >> executed 'ceph osd set noout' and 'sudo restart ceph-osd id=29 >> observed that the osd restart caused a minor shuffle of data, and actually >> left us with the same 4 pgs remapped PLUS 25 pgs stuck 'peering' (and btw: >> not active). >> After a couple of hours of waiting to see if the peering issues would >> resolve (they didn't), we moved an 'extra' OSD out of the root holding the >> EC pool, which kicked off a significant shuffle of data and ended up with >> everything good again and only 2 pgs active+remapped. Which two? >> Ironically, even though we were attempting to fix the two pgs that had OSD >> 29 as their primary by way of our osd restart attempt,only one of them >> repaired itself...leaving one pg still having osd.29 as it's primary. >> >> Where Things Stand Now: we have 2 pgs that are missing an appropriate >> OSD, and are currently remapped. Here is the (shortened) output of the pg >> queries: >> >> pg_stat objects state v >> reported up >> up_primary acting acting_primary >> 10.28a 1488 active+remapped 8145'3499 49904:63527 >> [64,73,0,32,3,59,2147483647,61] 64 [64,73,0,32,3,59,31,61] 64 >> 10.439 1455 active+remapped 8145'3423 49904:62378 >> [29,75,63,64,78,7,2147483647,60] 29 [29,75,63,64,78,7,8,60] 29 >> >> >> Our question is relatively simple (we think): how does one get a pg that is >> built using an EC_profile to fill in a missing OSD in it's 'up' definition? >> Neither 'ceph pg repair' or 'ceph osd repair' resolved the situation for us, >> and just randomly forcing re-shuffles of data seems haphazard at best. >> >> So...Does any one have a more targeted suggestion? If so - thanks! >> >> >> Paul Evans >> >> Principal Architect - Daystrom Technology Group >> >> >> ---------------------------- >> >> >> Two more notes: >> >> 1) we have many of these 'fault' messages in the logs and don't know if they >> are related in some way (172.16.x.x is the cluster back-end network): >> >> 2015-04-05 20:09:33.362107 7f6ac06ce700 0 -- 172.16.1.5:6839/8638 >> >> 172.16.1.7:6810/2370 pipe(0x6749a580 sd=116 :6839 s=2 pgs=77 cs=1 l=0 >> c=0x2f4851e0).fault with nothing to send, going to standby >> >> 2) here is the ceph osd tree and ceph -s output: >> >> ceph@lab-n1:~$ ceph -s >> cluster 68bc69c1-1382-4c30-9bf8-480e32cc5b92 >> health HEALTH_WARN 2 pgs stuck unclean; nodeep-scrub flag(s) set >> monmap e1: 3 mons at >> {lab-n1=10.0.50.211:6789/0,lab-n2=10.0.50.212:6789/0,nc48-n3=10.0.50.213:6789/0}, >> election epoch 236, quorum 0,1,2 lab-n1,lab-n2,lab-n3 >> osdmap e49905: 94 osds: 94 up, 94 in >> flags nodeep-scrub >> pgmap v1523516: 6144 pgs, 2 pools, 32949 GB data, 4130 kobjects >> 85133 GB used, 258 TB / 341 TB avail >> 6142 active+clean >> 2 active+remapped >> >> >> ceph@nc48-n1:~$ ceph osd tree >> # id weight type name up/down reweight >> -1 320.3 root default >> -2 40.04 host lab-n1 >> 0 3.64 osd.0 up 1 >> 6 3.64 osd.6 up 1 >> 12 3.64 osd.12 up 1 >> 18 3.64 osd.18 up 1 >> 24 3.64 osd.24 up 1 >> 30 3.64 osd.30 up 1 >> 36 3.64 osd.36 up 1 >> 42 3.64 osd.42 up 1 >> 48 3.64 osd.48 up 1 >> 54 3.64 osd.54 up 1 >> 60 3.64 osd.60 up 1 >> -3 40.04 host lab-n2 >> 1 3.64 osd.1 up 1 >> 7 3.64 osd.7 up 1 >> 13 3.64 osd.13 up 1 >> 19 3.64 osd.19 up 1 >> 25 3.64 osd.25 up 1 >> 31 3.64 osd.31 up 1 >> 37 3.64 osd.37 up 1 >> 43 3.64 osd.43 up 1 >> 49 3.64 osd.49 up 1 >> 55 3.64 osd.55 up 1 >> 61 3.64 osd.61 up 1 >> -4 40.04 host lab-n3 >> 2 3.64 osd.2 up 1 >> 8 3.64 osd.8 up 1 >> 14 3.64 osd.14 up 1 >> 20 3.64 osd.20 up 1 >> 26 3.64 osd.26 up 1 >> 32 3.64 osd.32 up 1 >> 38 3.64 osd.38 up 1 >> 44 3.64 osd.44 up 1 >> 50 3.64 osd.50 up 1 >> 56 3.64 osd.56 up 1 >> 62 3.64 osd.62 up 1 >> -5 40.04 host lab-n4 >> 3 3.64 osd.3 up 1 >> 9 3.64 osd.9 up 1 >> 15 3.64 osd.15 up 1 >> 21 3.64 osd.21 up 1 >> 27 3.64 osd.27 up 1 >> 33 3.64 osd.33 up 1 >> 39 3.64 osd.39 up 1 >> 45 3.64 osd.45 up 1 >> 51 3.64 osd.51 up 1 >> 57 3.64 osd.57 up 1 >> 63 3.64 osd.63 up 1 >> -6 40.04 host lab-n5 >> 4 3.64 osd.4 up 1 >> 10 3.64 osd.10 up 1 >> 16 3.64 osd.16 up 1 >> 22 3.64 osd.22 up 1 >> 28 3.64 osd.28 up 1 >> 34 3.64 osd.34 up 1 >> 40 3.64 osd.40 up 1 >> 46 3.64 osd.46 up 1 >> 52 3.64 osd.52 up 1 >> 58 3.64 osd.58 up 1 >> 64 3.64 osd.64 up 1 >> -7 40.04 host lab-n6 >> 5 3.64 osd.5 up 1 >> 11 3.64 osd.11 up 1 >> 17 3.64 osd.17 up 1 >> 23 3.64 osd.23 up 1 >> 29 3.64 osd.29 up 1 >> 35 3.64 osd.35 up 1 >> 41 3.64 osd.41 up 1 >> 47 3.64 osd.47 up 1 >> 53 3.64 osd.53 up 1 >> 59 3.64 osd.59 up 1 >> 65 3.64 osd.65 up 1 >> -15 40.04 host lab-n7 >> 72 3.64 osd.72 up 1 >> 74 3.64 osd.74 up 1 >> 76 3.64 osd.76 up 1 >> 78 3.64 osd.78 up 1 >> 80 3.64 osd.80 up 1 >> 82 3.64 osd.82 up 1 >> 84 3.64 osd.84 up 1 >> 86 3.64 osd.86 up 1 >> 88 3.64 osd.88 up 1 >> 90 3.64 osd.90 up 1 >> 92 3.64 osd.92 up 1 >> -16 40.04 host lab-n8 >> 73 3.64 osd.73 up 1 >> 75 3.64 osd.75 up 1 >> 77 3.64 osd.77 up 1 >> 79 3.64 osd.79 up 1 >> 81 3.64 osd.81 up 1 >> 83 3.64 osd.83 up 1 >> 85 3.64 osd.85 up 1 >> 87 3.64 osd.87 up 1 >> 89 3.64 osd.89 up 1 >> 91 3.64 osd.91 up 1 >> 93 3.64 osd.93 up 1 >> >> >> >> >> >> >> >> _______________________________________________ >> Ceph-community mailing list >> Ceph-community@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com >> > > > > -- > > Best Regards, > > Patrick McGarry > Director Ceph Community || Red Hat > http://ceph.com || http://community.redhat.com > @scuttlemonkey || @ceph > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com