Re: PGs stuck active+remapped and osds lose data?!

Marcus Müller <mueller.marcus@xxxxxxxxx> · Mon, 9 Jan 2017 22:22:55 +0100

Trying google with "ceph pg stuck in active and remapped" points to a couple of post on this ML typically indicating that it's a problem with the CRUSH map and ceph being unable to satisfy the mapping rules. Your ceph -s output indicates that your using replication of size 3 in your pools. You also said you had a custom CRUSH map - can you post it?

I’ve sent the file to you, since I’m not sure if it contains sensitive data. Yes I have replication of 3 and I did not customize the map by me.

I might be missing something here but I don't quite see how you come to this statement. ceph osd df and ceph -s both show 16093 GB used and 39779 GB out of 55872 GB available. The sum of the first 3 OSDs used space is, as you stated, 6181 GB which is approx 38.4% so quite close to your target of 33%

Maybe I have to explain it another way:
Directly after finishing the backfill I received this output:

     health HEALTH_WARN
            4 pgs stuck unclean
            recovery 1698/58476648 objects degraded (0.003%)
            recovery 418137/58476648 objects misplaced (0.715%)
            noscrub,nodeep-scrub flag(s) set
     monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
            election epoch 464, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
     osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
            16093 GB used, 39779 GB / 55872 GB avail
            1698/58476648 objects degraded (0.003%)
            418137/58476648 objects misplaced (0.715%)
                 316 active+clean
                   4 active+remapped
  client io 757 kB/s rd, 1 op/s

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  
 0 1.28899  1.00000  3724G  1924G  1799G 51.67 1.79 
 1 1.57899  1.00000  3724G  2143G  1580G 57.57 2.00 
 2 1.68900  1.00000  3724G  2114G  1609G 56.78 1.97 
 3 6.78499  1.00000  7450G  1234G  6215G 16.57 0.58 
 4 8.39999  1.00000  7450G  1221G  6228G 16.40 0.57 
 5 9.51500  1.00000  7450G  1232G  6217G 16.54 0.57 
 6 7.66499  1.00000  7450G  1258G  6191G 16.89 0.59 
 7 9.75499  1.00000  7450G  2482G  4967G 33.33 1.16 
 8 9.32999  1.00000  7450G  2480G  4969G 33.30 1.16 
              TOTAL 55872G 16093G 39779G 28.80      
MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54

Here we can see, that the cluster is using 4809 GB data and has raw used 16093GB. Or the other way, only 39779G available.

Two days later I saw:

     health HEALTH_WARN
            4 pgs stuck unclean
            recovery 3486/58726035 objects degraded (0.006%)
            recovery 420024/58726035 objects misplaced (0.715%)
            noscrub,nodeep-scrub flag(s) set
     monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
            election epoch 478, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
     osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v9969059: 320 pgs, 3 pools, 4830 GB data, 19116 kobjects
            15150 GB used, 40722 GB / 55872 GB avail
            3486/58726035 objects degraded (0.006%)
            420024/58726035 objects misplaced (0.715%)
                 316 active+clean
                   4 active+remapped

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  
 0 1.28899  1.00000  3724G  1696G  2027G 45.56 1.68 
 1 1.57899  1.00000  3724G  1705G  2018G 45.80 1.69 
 2 1.68900  1.00000  3724G  1794G  1929G 48.19 1.78 
 3 6.78499  1.00000  7450G  1239G  6210G 16.64 0.61 
 4 8.39999  1.00000  7450G  1226G  6223G 16.46 0.61 
 5 9.51500  1.00000  7450G  1237G  6212G 16.61 0.61 
 6 7.66499  1.00000  7450G  1263G  6186G 16.96 0.63 
 7 9.75499  1.00000  7450G  2493G  4956G 33.47 1.23 
 8 9.32999  1.00000  7450G  2491G  4958G 33.44 1.23 
              TOTAL 55872G 15150G 40722G 27.12      
MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54

As you can see now, we are using 4830 GB data BUT raw used is only 15150 GB or as said the other way, we have now 40722 GB free. You can see the change on the %USE of the osds. For me this looks like there is some data lost, since ceph did not do any backfill or other operation. That’s the problem...

Am 09.01.2017 um 21:55 schrieb Christian Wuerdig <christian.wuerdig@xxxxxxxxx>:

On Tue, Jan 10, 2017 at 8:23 AM, Marcus Müller <mueller.marcus@xxxxxxxxx> wrote:
Hi all,

Recently I added a new node with new osds to my cluster, which, of course resulted in backfilling. At the end, there are 4 pgs left in the state 4 active+remapped and I don’t know what to do. 

Here is how my cluster looks like currently: 

ceph -s
     health HEALTH_WARN
            4 pgs stuck unclean
            recovery 3586/58734009 objects degraded (0.006%)
            recovery 420074/58734009 objects misplaced (0.715%)
            noscrub,nodeep-scrub flag(s) set
     monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
            election epoch 478, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
     osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v9970276: 320 pgs, 3 pools, 4831 GB data, 19119 kobjects
            15152 GB used, 40719 GB / 55872 GB avail
            3586/58734009 objects degraded (0.006%)
            420074/58734009 objects misplaced (0.715%)
                 316 active+clean
                   4 active+remapped
  client io 643 kB/s rd, 7 op/s

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  
 0 1.28899  1.00000  3724G  1697G  2027G 45.57 1.68 
 1 1.57899  1.00000  3724G  1706G  2018G 45.81 1.69 
 2 1.68900  1.00000  3724G  1794G  1929G 48.19 1.78 
 3 6.78499  1.00000  7450G  1240G  6209G 16.65 0.61 
 4 8.39999  1.00000  7450G  1226G  6223G 16.47 0.61 
 5 9.51500  1.00000  7450G  1237G  6212G 16.62 0.61 
 6 7.66499  1.00000  7450G  1264G  6186G 16.97 0.63 
 7 9.75499  1.00000  7450G  2494G  4955G 33.48 1.23 
 8 9.32999  1.00000  7450G  2491G  4958G 33.45 1.23 
              TOTAL 55872G 15152G 40719G 27.12      
MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54

# ceph health detail
HEALTH_WARN 4 pgs stuck unclean; recovery 3586/58734015 objects degraded (0.006%); recovery 420074/58734015 objects misplaced (0.715%); noscrub,nodeep-scrub flag(s) set
pg 9.7 is stuck unclean for 512936.160212, current state active+remapped, last acting [7,3,0]
pg 7.84 is stuck unclean for 512623.894574, current state active+remapped, last acting [4,8,1]
pg 8.1b is stuck unclean for 513164.616377, current state active+remapped, last acting [4,7,2]
pg 7.7a is stuck unclean for 513162.316328, current state active+remapped, last acting [7,4,2]
recovery 3586/58734015 objects degraded (0.006%)
recovery 420074/58734015 objects misplaced (0.715%)
noscrub,nodeep-scrub flag(s) set

# ceph osd tree
ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 56.00693 root default                                     
-2  1.28899     host ceph1                                   
 0  1.28899         osd.0       up  1.00000          1.00000 
-3  1.57899     host ceph2                                   
 1  1.57899         osd.1       up  1.00000          1.00000 
-4  1.68900     host ceph3                                   
 2  1.68900         osd.2       up  1.00000          1.00000 
-5 32.36497     host ceph4                                   
 3  6.78499         osd.3       up  1.00000          1.00000 
 4  8.39999         osd.4       up  1.00000          1.00000 
 5  9.51500         osd.5       up  1.00000          1.00000 
 6  7.66499         osd.6       up  1.00000          1.00000 
-6 19.08498     host ceph5                                   
 7  9.75499         osd.7       up  1.00000          1.00000 
 8  9.32999         osd.8       up  1.00000          1.00000 

I’m using a customized crushmap because as you can see this cluster is not very optimal. Ceph1, ceph2 and ceph3 are vms on one physical host - Ceph4 and Ceph5 are both separate physical hosts. So the idea is to spread 33% of the data to ceph1, ceph2 and ceph3 and the other 66% to each ceph4 and ceph5.

Everything went fine with the backfilling but now I see those 4 pgs stuck active+remapped since 2 days while the degrades objects increase. 

I did a restart of all osds after and after but this helped not really. It first showed me no degraded objects and then increased again.

What can I do in order to get those pgs to active+clean state again? My idea was to increase the weight of a osd a little bit in order to let ceph calculate the map again, is this a good idea?

Trying google with "ceph pg stuck in active and remapped" points to a couple of post on this ML typically indicating that it's a problem with the CRUSH map and ceph being unable to satisfy the mapping rules. Your ceph -s output indicates that your using replication of size 3 in your pools. You also said you had a custom CRUSH map - can you post it?

---

On the other side I saw something very strange too: After the backfill was done (2 days ago), my ceph osd df looked like this:

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  
 0 1.28899  1.00000  3724G  1924G  1799G 51.67 1.79 
 1 1.57899  1.00000  3724G  2143G  1580G 57.57 2.00 
 2 1.68900  1.00000  3724G  2114G  1609G 56.78 1.97 
 3 6.78499  1.00000  7450G  1234G  6215G 16.57 0.58 
 4 8.39999  1.00000  7450G  1221G  6228G 16.40 0.57 
 5 9.51500  1.00000  7450G  1232G  6217G 16.54 0.57 
 6 7.66499  1.00000  7450G  1258G  6191G 16.89 0.59 
 7 9.75499  1.00000  7450G  2482G  4967G 33.33 1.16 
 8 9.32999  1.00000  7450G  2480G  4969G 33.30 1.16 
              TOTAL 55872G 16093G 39779G 28.80      
MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54

While ceph -s was:

     health HEALTH_WARN
            4 pgs stuck unclean
            recovery 1698/58476648 objects degraded (0.003%)
            recovery 418137/58476648 objects misplaced (0.715%)
            noscrub,nodeep-scrub flag(s) set
     monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
            election epoch 464, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
     osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
            16093 GB used, 39779 GB / 55872 GB avail
            1698/58476648 objects degraded (0.003%)
            418137/58476648 objects misplaced (0.715%)
                 316 active+clean
                   4 active+remapped
  client io 757 kB/s rd, 1 op/s

As you can see above my ceph osd df looks completely different -> This shows that the first three osds lost data (about 1 TB) without any backfill going on. If I calculate the amount of osd0, osd1 and osd2 it was 6181 GB. But there should be only around 33%, so this would be wrong.

I might be missing something here but I don't quite see how you come to this statement. ceph osd df and ceph -s both show 16093 GB used and 39779 GB out of 55872 GB available. The sum of the first 3 OSDs used space is, as you stated, 6181 GB which is approx 38.4% so quite close to your target of 33%

My question on this is: Is this a bug and I really lost important data or is this a ceph cleanup action after the backfill? 

Thanks and regards,
Marcus

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com