Re: PGs stuck active+remapped and osds lose data?!

Jens Dueholm Christensen <JEDC@xxxxxxxxxxx> · Wed, 11 Jan 2017 11:33:18 +0000

Hi

I'm the author of the mentioned thread on ceph-devel.

The second to last reply in that thread (http://marc.info/?l=ceph-devel&m=148396739308208&w=2 ) mentions what I suspected was the cause: 

Improper balance of the entire cluster (2 new nodes had double the capacity of the original cluster) and an attempt to reweigh usage caused the stuck+unclean PGs because CRUSH could no longer find an OSD to place the PG on once the OSD weight became too low.

On IRC, that suspicion was confirmed by Samuel Just (who replied to the thread) as indeed beeing the cause.

Marcus Müller:

A very quick look at the output from "ceph osd df" in your first mail shows some _very_ strange CRUSH weights:
osd.0 has a CRUSH weight of 1.28889 while osd.5 has a CRUSH weight of 9.51500.

You do realise that the CRUSH weight should be a representation of the OSDs capacity in TB, right?

That means that osd.0 should be able to hold 1.2TB data while osd.5 should be able to hold 9.5TB data - or at least that how CRUSH sees it - and that's it's basis for attempting to balance things out (and fails to do..).

You should change the CRUSH weight to reflect the real capacity of you OSDs across all nodes and OSDs.
Mind that this WILL - and I cannot stress this enough! - make a lot of data move around!!
It could also kill any client IO unless you lower the amount of allowed backfills to run from the same OSDs at once (google it..)

Lowering the CRUSH weight from a high number to a lower number can and WILL also raise your used capacity - be carefull you do not end up with a full cluster or risk ending up with a single OSD failure causing a full or over-capacity cluster.

A good explanation for what CRUSH weight and OSD weight represents for CRUSH can be seen here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040961.html

Once your CRUSH weights are correct you can try and lower the OSD weight a bit here and there, but too low a setting can again cause stuck+unclean PGs, but at least you will hopefully have a better balance for all your OSDs.

You MIGHT have to raise the tunable choose_total_tries a bit up from 50 once you start fiddling with the OSD weight, but for starters I REALLY REALLY _REALLY_ recommend you to fix those strange CRUSH weights before trying to do anything else but just to make sure you are using a somewhat same set of tunables appropriate for the version of Ceph you are running.

Regards,
Jens Dueholm Christensen 
Rambøll Survey IT

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Brad Hubbard
Sent: Monday, January 09, 2017 10:51 PM
To: Marcus Müller
Cc: Ceph Users
Subject: Re:  PGs stuck active+remapped and osds lose data?!

There is currently a thread about this very issue on the ceph-devel
mailing list (check archives for "PG stuck unclean after
rebalance-by-weight" in the last few days.

Have a read of http://www.anchor.com.au/blog/2013/02/pulling-apart-cephs-crush-algorithm/
and try bumping choose_total_tries up to 100 to begin with.

On Tue, Jan 10, 2017 at 7:22 AM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote:
> Trying google with "ceph pg stuck in active and remapped" points to a couple
> of post on this ML typically indicating that it's a problem with the CRUSH
> map and ceph being unable to satisfy the mapping rules. Your ceph -s output
> indicates that your using replication of size 3 in your pools. You also said
> you had a custom CRUSH map - can you post it?
>
>
> I’ve sent the file to you, since I’m not sure if it contains sensitive data.
> Yes I have replication of 3 and I did not customize the map by me.
>
>
> I might be missing something here but I don't quite see how you come to this
> statement. ceph osd df and ceph -s both show 16093 GB used and 39779 GB out
> of 55872 GB available. The sum of the first 3 OSDs used space is, as you
> stated, 6181 GB which is approx 38.4% so quite close to your target of 33%
>
>
> Maybe I have to explain it another way:
>
> Directly after finishing the backfill I received this output:
>
>      health HEALTH_WARN
>             4 pgs stuck unclean
>             recovery 1698/58476648 objects degraded (0.003%)
>             recovery 418137/58476648 objects misplaced (0.715%)
>             noscrub,nodeep-scrub flag(s) set
>      monmap e9: 5 mons at
> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>             election epoch 464, quorum 0,1,2,3,4
> ceph1,ceph2,ceph3,ceph4,ceph5
>      osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
>             flags noscrub,nodeep-scrub
>       pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
>             16093 GB used, 39779 GB / 55872 GB avail
>             1698/58476648 objects degraded (0.003%)
>             418137/58476648 objects misplaced (0.715%)
>                  316 active+clean
>                    4 active+remapped
>   client io 757 kB/s rd, 1 op/s
>
> # ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>  0 1.28899  1.00000  3724G  1924G  1799G 51.67 1.79
>  1 1.57899  1.00000  3724G  2143G  1580G 57.57 2.00
>  2 1.68900  1.00000  3724G  2114G  1609G 56.78 1.97
>  3 6.78499  1.00000  7450G  1234G  6215G 16.57 0.58
>  4 8.39999  1.00000  7450G  1221G  6228G 16.40 0.57
>  5 9.51500  1.00000  7450G  1232G  6217G 16.54 0.57
>  6 7.66499  1.00000  7450G  1258G  6191G 16.89 0.59
>  7 9.75499  1.00000  7450G  2482G  4967G 33.33 1.16
>  8 9.32999  1.00000  7450G  2480G  4969G 33.30 1.16
>               TOTAL 55872G 16093G 39779G 28.80
> MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54
>
> Here we can see, that the cluster is using 4809 GB data and has raw used
> 16093GB. Or the other way, only 39779G available.
>
> Two days later I saw:
>
>      health HEALTH_WARN
>             4 pgs stuck unclean
>             recovery 3486/58726035 objects degraded (0.006%)
>             recovery 420024/58726035 objects misplaced (0.715%)
>             noscrub,nodeep-scrub flag(s) set
>      monmap e9: 5 mons at
> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>             election epoch 478, quorum 0,1,2,3,4
> ceph1,ceph2,ceph3,ceph4,ceph5
>      osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>             flags noscrub,nodeep-scrub
>       pgmap v9969059: 320 pgs, 3 pools, 4830 GB data, 19116 kobjects
>             15150 GB used, 40722 GB / 55872 GB avail
>             3486/58726035 objects degraded (0.006%)
>             420024/58726035 objects misplaced (0.715%)
>                  316 active+clean
>                    4 active+remapped
>
> # ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>  0 1.28899  1.00000  3724G  1696G  2027G 45.56 1.68
>  1 1.57899  1.00000  3724G  1705G  2018G 45.80 1.69
>  2 1.68900  1.00000  3724G  1794G  1929G 48.19 1.78
>  3 6.78499  1.00000  7450G  1239G  6210G 16.64 0.61
>  4 8.39999  1.00000  7450G  1226G  6223G 16.46 0.61
>  5 9.51500  1.00000  7450G  1237G  6212G 16.61 0.61
>  6 7.66499  1.00000  7450G  1263G  6186G 16.96 0.63
>  7 9.75499  1.00000  7450G  2493G  4956G 33.47 1.23
>  8 9.32999  1.00000  7450G  2491G  4958G 33.44 1.23
>               TOTAL 55872G 15150G 40722G 27.12
> MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54
>
>
> As you can see now, we are using 4830 GB data BUT raw used is only 15150 GB
> or as said the other way, we have now 40722 GB free. You can see the change
> on the %USE of the osds. For me this looks like there is some data lost,
> since ceph did not do any backfill or other operation. That’s the problem...
>
>
> Am 09.01.2017 um 21:55 schrieb Christian Wuerdig
> <christian.wuerdig@xxxxxxxxx>:
>
>
>
> On Tue, Jan 10, 2017 at 8:23 AM, Marcus Müller <mueller.marcus@xxxxxxxxx>
> wrote:
>>
>> Hi all,
>>
>> Recently I added a new node with new osds to my cluster, which, of course
>> resulted in backfilling. At the end, there are 4 pgs left in the state 4
>> active+remapped and I don’t know what to do.
>>
>> Here is how my cluster looks like currently:
>>
>> ceph -s
>>      health HEALTH_WARN
>>             4 pgs stuck unclean
>>             recovery 3586/58734009 objects degraded (0.006%)
>>             recovery 420074/58734009 objects misplaced (0.715%)
>>             noscrub,nodeep-scrub flag(s) set
>>      monmap e9: 5 mons at
>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>             election epoch 478, quorum 0,1,2,3,4
>> ceph1,ceph2,ceph3,ceph4,ceph5
>>      osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>             flags noscrub,nodeep-scrub
>>       pgmap v9970276: 320 pgs, 3 pools, 4831 GB data, 19119 kobjects
>>             15152 GB used, 40719 GB / 55872 GB avail
>>             3586/58734009 objects degraded (0.006%)
>>             420074/58734009 objects misplaced (0.715%)
>>                  316 active+clean
>>                    4 active+remapped
>>   client io 643 kB/s rd, 7 op/s
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>  0 1.28899  1.00000  3724G  1697G  2027G 45.57 1.68
>>  1 1.57899  1.00000  3724G  1706G  2018G 45.81 1.69
>>  2 1.68900  1.00000  3724G  1794G  1929G 48.19 1.78
>>  3 6.78499  1.00000  7450G  1240G  6209G 16.65 0.61
>>  4 8.39999  1.00000  7450G  1226G  6223G 16.47 0.61
>>  5 9.51500  1.00000  7450G  1237G  6212G 16.62 0.61
>>  6 7.66499  1.00000  7450G  1264G  6186G 16.97 0.63
>>  7 9.75499  1.00000  7450G  2494G  4955G 33.48 1.23
>>  8 9.32999  1.00000  7450G  2491G  4958G 33.45 1.23
>>               TOTAL 55872G 15152G 40719G 27.12
>> MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54
>>
>> # ceph health detail
>> HEALTH_WARN 4 pgs stuck unclean; recovery 3586/58734015 objects degraded
>> (0.006%); recovery 420074/58734015 objects misplaced (0.715%);
>> noscrub,nodeep-scrub flag(s) set
>> pg 9.7 is stuck unclean for 512936.160212, current state active+remapped,
>> last acting [7,3,0]
>> pg 7.84 is stuck unclean for 512623.894574, current state active+remapped,
>> last acting [4,8,1]
>> pg 8.1b is stuck unclean for 513164.616377, current state active+remapped,
>> last acting [4,7,2]
>> pg 7.7a is stuck unclean for 513162.316328, current state active+remapped,
>> last acting [7,4,2]
>> recovery 3586/58734015 objects degraded (0.006%)
>> recovery 420074/58734015 objects misplaced (0.715%)
>> noscrub,nodeep-scrub flag(s) set
>>
>> # ceph osd tree
>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 56.00693 root default
>> -2  1.28899     host ceph1
>>  0  1.28899         osd.0       up  1.00000          1.00000
>> -3  1.57899     host ceph2
>>  1  1.57899         osd.1       up  1.00000          1.00000
>> -4  1.68900     host ceph3
>>  2  1.68900         osd.2       up  1.00000          1.00000
>> -5 32.36497     host ceph4
>>  3  6.78499         osd.3       up  1.00000          1.00000
>>  4  8.39999         osd.4       up  1.00000          1.00000
>>  5  9.51500         osd.5       up  1.00000          1.00000
>>  6  7.66499         osd.6       up  1.00000          1.00000
>> -6 19.08498     host ceph5
>>  7  9.75499         osd.7       up  1.00000          1.00000
>>  8  9.32999         osd.8       up  1.00000          1.00000
>>
>> I’m using a customized crushmap because as you can see this cluster is not
>> very optimal. Ceph1, ceph2 and ceph3 are vms on one physical host - Ceph4
>> and Ceph5 are both separate physical hosts. So the idea is to spread 33% of
>> the data to ceph1, ceph2 and ceph3 and the other 66% to each ceph4 and
>> ceph5.
>>
>> Everything went fine with the backfilling but now I see those 4 pgs stuck
>> active+remapped since 2 days while the degrades objects increase.
>>
>> I did a restart of all osds after and after but this helped not really. It
>> first showed me no degraded objects and then increased again.
>>
>> What can I do in order to get those pgs to active+clean state again? My
>> idea was to increase the weight of a osd a little bit in order to let ceph
>> calculate the map again, is this a good idea?
>
>
> Trying google with "ceph pg stuck in active and remapped" points to a couple
> of post on this ML typically indicating that it's a problem with the CRUSH
> map and ceph being unable to satisfy the mapping rules. Your ceph -s output
> indicates that your using replication of size 3 in your pools. You also said
> you had a custom CRUSH map - can you post it?
>
>>
>>
>> ---
>>
>> On the other side I saw something very strange too: After the backfill was
>> done (2 days ago), my ceph osd df looked like this:
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR
>>  0 1.28899  1.00000  3724G  1924G  1799G 51.67 1.79
>>  1 1.57899  1.00000  3724G  2143G  1580G 57.57 2.00
>>  2 1.68900  1.00000  3724G  2114G  1609G 56.78 1.97
>>  3 6.78499  1.00000  7450G  1234G  6215G 16.57 0.58
>>  4 8.39999  1.00000  7450G  1221G  6228G 16.40 0.57
>>  5 9.51500  1.00000  7450G  1232G  6217G 16.54 0.57
>>  6 7.66499  1.00000  7450G  1258G  6191G 16.89 0.59
>>  7 9.75499  1.00000  7450G  2482G  4967G 33.33 1.16
>>  8 9.32999  1.00000  7450G  2480G  4969G 33.30 1.16
>>               TOTAL 55872G 16093G 39779G 28.80
>> MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54
>>
>> While ceph -s was:
>>
>>      health HEALTH_WARN
>>             4 pgs stuck unclean
>>             recovery 1698/58476648 objects degraded (0.003%)
>>             recovery 418137/58476648 objects misplaced (0.715%)
>>             noscrub,nodeep-scrub flag(s) set
>>      monmap e9: 5 mons at
>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>             election epoch 464, quorum 0,1,2,3,4
>> ceph1,ceph2,ceph3,ceph4,ceph5
>>      osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
>>             flags noscrub,nodeep-scrub
>>       pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
>>             16093 GB used, 39779 GB / 55872 GB avail
>>             1698/58476648 objects degraded (0.003%)
>>             418137/58476648 objects misplaced (0.715%)
>>                  316 active+clean
>>                    4 active+remapped
>>   client io 757 kB/s rd, 1 op/s
>>
>>
>> As you can see above my ceph osd df looks completely different -> This
>> shows that the first three osds lost data (about 1 TB) without any backfill
>> going on. If I calculate the amount of osd0, osd1 and osd2 it was 6181 GB.
>> But there should be only around 33%, so this would be wrong.
>
>
> I might be missing something here but I don't quite see how you come to this
> statement. ceph osd df and ceph -s both show 16093 GB used and 39779 GB out
> of 55872 GB available. The sum of the first 3 OSDs used space is, as you
> stated, 6181 GB which is approx 38.4% so quite close to your target of 33%
>
>>
>>
>> My question on this is: Is this a bug and I really lost important data or
>> is this a ceph cleanup action after the backfill?
>>
>> Thanks and regards,
>> Marcus
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com