Re: [Ceph-community] Interesting problem: 2 pgs stuck in EC pool with missing OSDs

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 6 Apr 2015 11:03:11 -0700

On Mon, Apr 6, 2015 at 7:48 AM, Patrick McGarry <pmcgarry@xxxxxxxxxx> wrote:
> moving this to ceph-user where it needs to be for eyeballs and responses. :)
>
>
> On Mon, Apr 6, 2015 at 1:34 AM, Paul Evans <paul@xxxxxxxxxxxx> wrote:
>> Hello Ceph Community & thanks to anyone with advice on this interesting
>> situation...
>> ========================================================
>> The Problem: we have 2 pgs out of 6144 that are still stuck in an
>> active+remapped state, and we would like to know if there is a targeted &
>> specific way to fix this issue (other than just forcing data to re-sort in
>> the cluster in a generic re-shuffle).
>>
>> Background: We initially created a cluster of 6 ceph nodes with an EC pool &
>> profile where k=6 and m=2, but missed the configuration item
>> "ruleset-failure-domain=host" in the EC_profile (thinking it defaulted to
>> =osd). While the ceph cluster allowed us to create the pool and store data
>> in it,  the fact that we had an EC data spread designed for 8 targets
>> (k+m=8), but only had 6 targets (our 6 nodes)  eventually caught up to us
>> and we ended up with a number of pgs missing chunks of data.  Fortunately,
>> the data remained 'relatively protected' because ceph remapped the missing
>> chunks to alternate hosts, but (of course) that left the pgs in an
>> active+remapped state and no way to solve the puzzle.
>> The fix? Easy enough: add two more nodes, which we did and *almost* all the
>> pgs re-distributed the data appropriately. Except for 4 pgs.

This looks like it's just the standard risk of using a pseudo-random
algorithm: you need to "randomly" map 8 pieces into 8 slots. Sometimes
the CRUSH calculation will return the same 7 slots so many times in a
row that it simply fails to get all 8 of them inside of the time
bounds that are currently set.

If you look through the list archives we've discussed this a few
times, especially Loïc in the context of erasure coding. See
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon
for the fix.
But I think that doc is wrong and you can change the CRUSH rule in use
without creating a new pool — right, Loïc?
-Greg

>> Why 4? We're
>> not sure what was unique about those four, but we were able to reduce the
>> problem pgs  to just 2 (as stated in our Problem section) by doing the
>> following:
>>
>> executed 'ceph pg repair xx.xxx, but nothing happens. After an hour...
>> executed 'ceph pg dump_stuck' and noted that 2 of the 4 pgs had a primary
>> OSD of 29.
>> executed 'ceph osd set noout' and 'sudo restart ceph-osd id=29
>> observed that the osd restart caused a minor shuffle of data, and actually
>> left us with the same 4 pgs remapped PLUS 25 pgs stuck 'peering'  (and btw:
>> not active).
>> After a couple of hours of waiting to see if the peering issues would
>> resolve (they didn't), we moved an 'extra' OSD out of the root holding the
>> EC pool, which kicked off a significant shuffle of data and ended up with
>> everything good again and only 2 pgs active+remapped.  Which two?
>> Ironically, even though we were attempting to fix the two pgs that had OSD
>> 29 as their primary by way of our osd restart attempt,only one of them
>> repaired itself...leaving one pg still having osd.29 as it's primary.
>>
>> Where Things Stand Now:   we  have 2 pgs that are missing an appropriate
>> OSD, and are currently remapped. Here is the (shortened) output of the pg
>> queries:
>>
>> pg_stat    objects     state                       v
>> reported              up
>> up_primary    acting            acting_primary
>> 10.28a    1488    active+remapped     8145'3499    49904:63527
>> [64,73,0,32,3,59,2147483647,61]      64    [64,73,0,32,3,59,31,61]    64
>> 10.439    1455    active+remapped     8145'3423    49904:62378
>> [29,75,63,64,78,7,2147483647,60]    29    [29,75,63,64,78,7,8,60]    29
>>
>>
>> Our question is relatively simple (we think): how does one get a pg that is
>> built using an EC_profile to fill in a missing OSD in it's 'up' definition?
>> Neither 'ceph pg repair' or 'ceph osd repair' resolved the situation for us,
>> and just randomly forcing re-shuffles of data seems haphazard at best.
>>
>> So...Does any one have a more targeted suggestion?  If so - thanks!
>>
>>
>> Paul Evans
>>
>> Principal Architect - Daystrom Technology Group
>>
>>
>> ----------------------------
>>
>>
>> Two more notes:
>>
>> 1) we have many of these 'fault' messages in the logs and don't know if they
>> are related in some way (172.16.x.x is the cluster back-end network):
>>
>> 2015-04-05 20:09:33.362107 7f6ac06ce700  0 -- 172.16.1.5:6839/8638 >>
>> 172.16.1.7:6810/2370 pipe(0x6749a580 sd=116 :6839 s=2 pgs=77 cs=1 l=0
>> c=0x2f4851e0).fault with nothing to send, going to standby
>>
>> 2)  here is the ceph osd tree and ceph -s output:
>>
>> ceph@lab-n1:~$ ceph -s
>>     cluster 68bc69c1-1382-4c30-9bf8-480e32cc5b92
>>      health HEALTH_WARN 2 pgs stuck unclean; nodeep-scrub flag(s) set
>>      monmap e1: 3 mons at
>> {lab-n1=10.0.50.211:6789/0,lab-n2=10.0.50.212:6789/0,nc48-n3=10.0.50.213:6789/0},
>> election epoch 236, quorum 0,1,2 lab-n1,lab-n2,lab-n3
>>      osdmap e49905: 94 osds: 94 up, 94 in
>>             flags nodeep-scrub
>>       pgmap v1523516: 6144 pgs, 2 pools, 32949 GB data, 4130 kobjects
>>             85133 GB used, 258 TB / 341 TB avail
>>                 6142 active+clean
>>                    2 active+remapped
>>
>>
>> ceph@nc48-n1:~$ ceph osd tree
>> # id    weight    type name    up/down    reweight
>> -1    320.3    root default
>> -2    40.04        host lab-n1
>> 0    3.64            osd.0    up    1
>> 6    3.64            osd.6    up    1
>> 12    3.64            osd.12    up    1
>> 18    3.64            osd.18    up    1
>> 24    3.64            osd.24    up    1
>> 30    3.64            osd.30    up    1
>> 36    3.64            osd.36    up    1
>> 42    3.64            osd.42    up    1
>> 48    3.64            osd.48    up    1
>> 54    3.64            osd.54    up    1
>> 60    3.64            osd.60    up    1
>> -3    40.04        host lab-n2
>> 1    3.64            osd.1    up    1
>> 7    3.64            osd.7    up    1
>> 13    3.64            osd.13    up    1
>> 19    3.64            osd.19    up    1
>> 25    3.64            osd.25    up    1
>> 31    3.64            osd.31    up    1
>> 37    3.64            osd.37    up    1
>> 43    3.64            osd.43    up    1
>> 49    3.64            osd.49    up    1
>> 55    3.64            osd.55    up    1
>> 61    3.64            osd.61    up    1
>> -4    40.04        host lab-n3
>> 2    3.64            osd.2    up    1
>> 8    3.64            osd.8    up    1
>> 14    3.64            osd.14    up    1
>> 20    3.64            osd.20    up    1
>> 26    3.64            osd.26    up    1
>> 32    3.64            osd.32    up    1
>> 38    3.64            osd.38    up    1
>> 44    3.64            osd.44    up    1
>> 50    3.64            osd.50    up    1
>> 56    3.64            osd.56    up    1
>> 62    3.64            osd.62    up    1
>> -5    40.04        host lab-n4
>> 3    3.64            osd.3    up    1
>> 9    3.64            osd.9    up    1
>> 15    3.64            osd.15    up    1
>> 21    3.64            osd.21    up    1
>> 27    3.64            osd.27    up    1
>> 33    3.64            osd.33    up    1
>> 39    3.64            osd.39    up    1
>> 45    3.64            osd.45    up    1
>> 51    3.64            osd.51    up    1
>> 57    3.64            osd.57    up    1
>> 63    3.64            osd.63    up    1
>> -6    40.04        host lab-n5
>> 4    3.64            osd.4    up    1
>> 10    3.64            osd.10    up    1
>> 16    3.64            osd.16    up    1
>> 22    3.64            osd.22    up    1
>> 28    3.64            osd.28    up    1
>> 34    3.64            osd.34    up    1
>> 40    3.64            osd.40    up    1
>> 46    3.64            osd.46    up    1
>> 52    3.64            osd.52    up    1
>> 58    3.64            osd.58    up    1
>> 64    3.64            osd.64    up    1
>> -7    40.04        host lab-n6
>> 5    3.64            osd.5    up    1
>> 11    3.64            osd.11    up    1
>> 17    3.64            osd.17    up    1
>> 23    3.64            osd.23    up    1
>> 29    3.64            osd.29    up    1
>> 35    3.64            osd.35    up    1
>> 41    3.64            osd.41    up    1
>> 47    3.64            osd.47    up    1
>> 53    3.64            osd.53    up    1
>> 59    3.64            osd.59    up    1
>> 65    3.64            osd.65    up    1
>> -15    40.04        host lab-n7
>> 72    3.64            osd.72    up    1
>> 74    3.64            osd.74    up    1
>> 76    3.64            osd.76    up    1
>> 78    3.64            osd.78    up    1
>> 80    3.64            osd.80    up    1
>> 82    3.64            osd.82    up    1
>> 84    3.64            osd.84    up    1
>> 86    3.64            osd.86    up    1
>> 88    3.64            osd.88    up    1
>> 90    3.64            osd.90    up    1
>> 92    3.64            osd.92    up    1
>> -16    40.04        host lab-n8
>> 73    3.64            osd.73    up    1
>> 75    3.64            osd.75    up    1
>> 77    3.64            osd.77    up    1
>> 79    3.64            osd.79    up    1
>> 81    3.64            osd.81    up    1
>> 83    3.64            osd.83    up    1
>> 85    3.64            osd.85    up    1
>> 87    3.64            osd.87    up    1
>> 89    3.64            osd.89    up    1
>> 91    3.64            osd.91    up    1
>> 93    3.64            osd.93    up    1
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Ceph-community mailing list
>> Ceph-community@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>>
>
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com