Re: PGs stuck active+remapped and osds lose data?!

Shinobu Kinjo <skinjo@xxxxxxxxxx> · Tue, 10 Jan 2017 07:58:52 +0900

> pg 9.7 is stuck unclean for 512936.160212, current state active+remapped, last acting [7,3,0]
> pg 7.84 is stuck unclean for 512623.894574, current state active+remapped, last acting [4,8,1]
> pg 8.1b is stuck unclean for 513164.616377, current state active+remapped, last acting [4,7,2]
> pg 7.7a is stuck unclean for 513162.316328, current state active+remapped, last acting [7,4,2]

Please execute:

for pg in 9.7 7.84 8.1b 7.7a;do ceph pg $pg query; done

Regards,

On Tue, Jan 10, 2017 at 7:31 AM, Christian Wuerdig
<christian.wuerdig@xxxxxxxxx> wrote:
>
>
> On Tue, Jan 10, 2017 at 10:22 AM, Marcus Müller <mueller.marcus@xxxxxxxxx>
> wrote:
>>
>> Trying google with "ceph pg stuck in active and remapped" points to a
>> couple of post on this ML typically indicating that it's a problem with the
>> CRUSH map and ceph being unable to satisfy the mapping rules. Your ceph -s
>> output indicates that your using replication of size 3 in your pools. You
>> also said you had a custom CRUSH map - can you post it?
>>
>>
>> I’ve sent the file to you, since I’m not sure if it contains sensitive
>> data. Yes I have replication of 3 and I did not customize the map by me.
>
>
> I received your map but I'm not familiar enough with the details to give any
> particular advise on this - I just suggested to post your map in case
> someone more familiar with the CRUSH details might be able to spot
> something. Brad just provided a pointer so that would be useful to try.
>
>>
>>
>>
>> I might be missing something here but I don't quite see how you come to
>> this statement. ceph osd df and ceph -s both show 16093 GB used and 39779 GB
>> out of 55872 GB available. The sum of the first 3 OSDs used space is, as you
>> stated, 6181 GB which is approx 38.4% so quite close to your target of 33%
>>
>>
>> Maybe I have to explain it another way:
>>
>> Directly after finishing the backfill I received this output:
>>
>>      health HEALTH_WARN
>>             4 pgs stuck unclean
>>             recovery 1698/58476648 objects degraded (0.003%)
>>             recovery 418137/58476648 objects misplaced (0.715%)
>>             noscrub,nodeep-scrub flag(s) set
>>      monmap e9: 5 mons at
>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>             election epoch 464, quorum 0,1,2,3,4
>> ceph1,ceph2,ceph3,ceph4,ceph5
>>      osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
>>             flags noscrub,nodeep-scrub
>>       pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
>>             16093 GB used, 39779 GB / 55872 GB avail
>>             1698/58476648 objects degraded (0.003%)
>>             418137/58476648 objects misplaced (0.715%)
>>                  316 active+clean
>>                    4 active+remapped
>>   client io 757 kB/s rd, 1 op/s
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>  0 1.28899  1.00000  3724G  1924G  1799G 51.67 1.79
>>  1 1.57899  1.00000  3724G  2143G  1580G 57.57 2.00
>>  2 1.68900  1.00000  3724G  2114G  1609G 56.78 1.97
>>  3 6.78499  1.00000  7450G  1234G  6215G 16.57 0.58
>>  4 8.39999  1.00000  7450G  1221G  6228G 16.40 0.57
>>  5 9.51500  1.00000  7450G  1232G  6217G 16.54 0.57
>>  6 7.66499  1.00000  7450G  1258G  6191G 16.89 0.59
>>  7 9.75499  1.00000  7450G  2482G  4967G 33.33 1.16
>>  8 9.32999  1.00000  7450G  2480G  4969G 33.30 1.16
>>               TOTAL 55872G 16093G 39779G 28.80
>> MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54
>>
>> Here we can see, that the cluster is using 4809 GB data and has raw used
>> 16093GB. Or the other way, only 39779G available.
>>
>> Two days later I saw:
>>
>>      health HEALTH_WARN
>>             4 pgs stuck unclean
>>             recovery 3486/58726035 objects degraded (0.006%)
>>             recovery 420024/58726035 objects misplaced (0.715%)
>>             noscrub,nodeep-scrub flag(s) set
>>      monmap e9: 5 mons at
>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>             election epoch 478, quorum 0,1,2,3,4
>> ceph1,ceph2,ceph3,ceph4,ceph5
>>      osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>             flags noscrub,nodeep-scrub
>>       pgmap v9969059: 320 pgs, 3 pools, 4830 GB data, 19116 kobjects
>>             15150 GB used, 40722 GB / 55872 GB avail
>>             3486/58726035 objects degraded (0.006%)
>>             420024/58726035 objects misplaced (0.715%)
>>                  316 active+clean
>>                    4 active+remapped
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>  0 1.28899  1.00000  3724G  1696G  2027G 45.56 1.68
>>  1 1.57899  1.00000  3724G  1705G  2018G 45.80 1.69
>>  2 1.68900  1.00000  3724G  1794G  1929G 48.19 1.78
>>  3 6.78499  1.00000  7450G  1239G  6210G 16.64 0.61
>>  4 8.39999  1.00000  7450G  1226G  6223G 16.46 0.61
>>  5 9.51500  1.00000  7450G  1237G  6212G 16.61 0.61
>>  6 7.66499  1.00000  7450G  1263G  6186G 16.96 0.63
>>  7 9.75499  1.00000  7450G  2493G  4956G 33.47 1.23
>>  8 9.32999  1.00000  7450G  2491G  4958G 33.44 1.23
>>               TOTAL 55872G 15150G 40722G 27.12
>> MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54
>>
>>
>> As you can see now, we are using 4830 GB data BUT raw used is only 15150
>> GB or as said the other way, we have now 40722 GB free. You can see the
>> change on the %USE of the osds. For me this looks like there is some data
>> lost, since ceph did not do any backfill or other operation. That’s the
>> problem...
>>
>
> Ok that output is indeed a bit different. However as you should note the
> actual data stored in the cluster goes from 4809 to 4830 GB. 4830 * 3 is
> actually only 14490 GB so currently it's using a bit more space than
> strictly necessary. My guess would be that the data gets migrated first to
> the new OSDs before being deleted from the old OSD and as such it will
> transiently use up more space. Pretty sure that you didn't loose any data.
>
>>
>>
>> Am 09.01.2017 um 21:55 schrieb Christian Wuerdig
>> <christian.wuerdig@xxxxxxxxx>:
>>
>>
>>
>> On Tue, Jan 10, 2017 at 8:23 AM, Marcus Müller <mueller.marcus@xxxxxxxxx>
>> wrote:
>>>
>>> Hi all,
>>>
>>> Recently I added a new node with new osds to my cluster, which, of course
>>> resulted in backfilling. At the end, there are 4 pgs left in the state 4
>>> active+remapped and I don’t know what to do.
>>>
>>> Here is how my cluster looks like currently:
>>>
>>> ceph -s
>>>      health HEALTH_WARN
>>>             4 pgs stuck unclean
>>>             recovery 3586/58734009 objects degraded (0.006%)
>>>             recovery 420074/58734009 objects misplaced (0.715%)
>>>             noscrub,nodeep-scrub flag(s) set
>>>      monmap e9: 5 mons at
>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>>             election epoch 478, quorum 0,1,2,3,4
>>> ceph1,ceph2,ceph3,ceph4,ceph5
>>>      osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>             flags noscrub,nodeep-scrub
>>>       pgmap v9970276: 320 pgs, 3 pools, 4831 GB data, 19119 kobjects
>>>             15152 GB used, 40719 GB / 55872 GB avail
>>>             3586/58734009 objects degraded (0.006%)
>>>             420074/58734009 objects misplaced (0.715%)
>>>                  316 active+clean
>>>                    4 active+remapped
>>>   client io 643 kB/s rd, 7 op/s
>>>
>>> # ceph osd df
>>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>>  0 1.28899  1.00000  3724G  1697G  2027G 45.57 1.68
>>>  1 1.57899  1.00000  3724G  1706G  2018G 45.81 1.69
>>>  2 1.68900  1.00000  3724G  1794G  1929G 48.19 1.78
>>>  3 6.78499  1.00000  7450G  1240G  6209G 16.65 0.61
>>>  4 8.39999  1.00000  7450G  1226G  6223G 16.47 0.61
>>>  5 9.51500  1.00000  7450G  1237G  6212G 16.62 0.61
>>>  6 7.66499  1.00000  7450G  1264G  6186G 16.97 0.63
>>>  7 9.75499  1.00000  7450G  2494G  4955G 33.48 1.23
>>>  8 9.32999  1.00000  7450G  2491G  4958G 33.45 1.23
>>>               TOTAL 55872G 15152G 40719G 27.12
>>> MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54
>>>
>>> # ceph health detail
>>> HEALTH_WARN 4 pgs stuck unclean; recovery 3586/58734015 objects degraded
>>> (0.006%); recovery 420074/58734015 objects misplaced (0.715%);
>>> noscrub,nodeep-scrub flag(s) set
>>> pg 9.7 is stuck unclean for 512936.160212, current state active+remapped,
>>> last acting [7,3,0]
>>> pg 7.84 is stuck unclean for 512623.894574, current state
>>> active+remapped, last acting [4,8,1]
>>> pg 8.1b is stuck unclean for 513164.616377, current state
>>> active+remapped, last acting [4,7,2]
>>> pg 7.7a is stuck unclean for 513162.316328, current state
>>> active+remapped, last acting [7,4,2]
>>> recovery 3586/58734015 objects degraded (0.006%)
>>> recovery 420074/58734015 objects misplaced (0.715%)
>>> noscrub,nodeep-scrub flag(s) set
>>>
>>> # ceph osd tree
>>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> -1 56.00693 root default
>>> -2  1.28899     host ceph1
>>>  0  1.28899         osd.0       up  1.00000          1.00000
>>> -3  1.57899     host ceph2
>>>  1  1.57899         osd.1       up  1.00000          1.00000
>>> -4  1.68900     host ceph3
>>>  2  1.68900         osd.2       up  1.00000          1.00000
>>> -5 32.36497     host ceph4
>>>  3  6.78499         osd.3       up  1.00000          1.00000
>>>  4  8.39999         osd.4       up  1.00000          1.00000
>>>  5  9.51500         osd.5       up  1.00000          1.00000
>>>  6  7.66499         osd.6       up  1.00000          1.00000
>>> -6 19.08498     host ceph5
>>>  7  9.75499         osd.7       up  1.00000          1.00000
>>>  8  9.32999         osd.8       up  1.00000          1.00000
>>>
>>> I’m using a customized crushmap because as you can see this cluster is
>>> not very optimal. Ceph1, ceph2 and ceph3 are vms on one physical host -
>>> Ceph4 and Ceph5 are both separate physical hosts. So the idea is to spread
>>> 33% of the data to ceph1, ceph2 and ceph3 and the other 66% to each ceph4
>>> and ceph5.
>>>
>>> Everything went fine with the backfilling but now I see those 4 pgs stuck
>>> active+remapped since 2 days while the degrades objects increase.
>>>
>>> I did a restart of all osds after and after but this helped not really.
>>> It first showed me no degraded objects and then increased again.
>>>
>>> What can I do in order to get those pgs to active+clean state again? My
>>> idea was to increase the weight of a osd a little bit in order to let ceph
>>> calculate the map again, is this a good idea?
>>
>>
>> Trying google with "ceph pg stuck in active and remapped" points to a
>> couple of post on this ML typically indicating that it's a problem with the
>> CRUSH map and ceph being unable to satisfy the mapping rules. Your ceph -s
>> output indicates that your using replication of size 3 in your pools. You
>> also said you had a custom CRUSH map - can you post it?
>>
>>>
>>>
>>> ---
>>>
>>> On the other side I saw something very strange too: After the backfill
>>> was done (2 days ago), my ceph osd df looked like this:
>>>
>>> # ceph osd df
>>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>>  0 1.28899  1.00000  3724G  1924G  1799G 51.67 1.79
>>>  1 1.57899  1.00000  3724G  2143G  1580G 57.57 2.00
>>>  2 1.68900  1.00000  3724G  2114G  1609G 56.78 1.97
>>>  3 6.78499  1.00000  7450G  1234G  6215G 16.57 0.58
>>>  4 8.39999  1.00000  7450G  1221G  6228G 16.40 0.57
>>>  5 9.51500  1.00000  7450G  1232G  6217G 16.54 0.57
>>>  6 7.66499  1.00000  7450G  1258G  6191G 16.89 0.59
>>>  7 9.75499  1.00000  7450G  2482G  4967G 33.33 1.16
>>>  8 9.32999  1.00000  7450G  2480G  4969G 33.30 1.16
>>>               TOTAL 55872G 16093G 39779G 28.80
>>> MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54
>>>
>>> While ceph -s was:
>>>
>>>      health HEALTH_WARN
>>>             4 pgs stuck unclean
>>>             recovery 1698/58476648 objects degraded (0.003%)
>>>             recovery 418137/58476648 objects misplaced (0.715%)
>>>             noscrub,nodeep-scrub flag(s) set
>>>      monmap e9: 5 mons at
>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>>             election epoch 464, quorum 0,1,2,3,4
>>> ceph1,ceph2,ceph3,ceph4,ceph5
>>>      osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>             flags noscrub,nodeep-scrub
>>>       pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
>>>             16093 GB used, 39779 GB / 55872 GB avail
>>>             1698/58476648 objects degraded (0.003%)
>>>             418137/58476648 objects misplaced (0.715%)
>>>                  316 active+clean
>>>                    4 active+remapped
>>>   client io 757 kB/s rd, 1 op/s
>>>
>>>
>>> As you can see above my ceph osd df looks completely different -> This
>>> shows that the first three osds lost data (about 1 TB) without any backfill
>>> going on. If I calculate the amount of osd0, osd1 and osd2 it was 6181 GB.
>>> But there should be only around 33%, so this would be wrong.
>>
>>
>> I might be missing something here but I don't quite see how you come to
>> this statement. ceph osd df and ceph -s both show 16093 GB used and 39779 GB
>> out of 55872 GB available. The sum of the first 3 OSDs used space is, as you
>> stated, 6181 GB which is approx 38.4% so quite close to your target of 33%
>>
>>>
>>>
>>> My question on this is: Is this a bug and I really lost important data or
>>> is this a ceph cleanup action after the backfill?
>>>
>>> Thanks and regards,
>>> Marcus
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com