PGs stuck active+remapped and osds lose data?!

Marcus Müller <mueller.marcus@xxxxxxxxx> · Mon, 9 Jan 2017 20:23:51 +0100

Hi all,

Recently I added a new node with new osds to my cluster, which, of course resulted in backfilling. At the end, there are 4 pgs left in the state 4 active+remapped and I don’t know what to do. 

Here is how my cluster looks like currently: 

ceph -s
     health HEALTH_WARN
            4 pgs stuck unclean
            recovery 3586/58734009 objects degraded (0.006%)
            recovery 420074/58734009 objects misplaced (0.715%)
            noscrub,nodeep-scrub flag(s) set
     monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
            election epoch 478, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
     osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v9970276: 320 pgs, 3 pools, 4831 GB data, 19119 kobjects
            15152 GB used, 40719 GB / 55872 GB avail
            3586/58734009 objects degraded (0.006%)
            420074/58734009 objects misplaced (0.715%)
                 316 active+clean
                   4 active+remapped
  client io 643 kB/s rd, 7 op/s

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  
 0 1.28899  1.00000  3724G  1697G  2027G 45.57 1.68 
 1 1.57899  1.00000  3724G  1706G  2018G 45.81 1.69 
 2 1.68900  1.00000  3724G  1794G  1929G 48.19 1.78 
 3 6.78499  1.00000  7450G  1240G  6209G 16.65 0.61 
 4 8.39999  1.00000  7450G  1226G  6223G 16.47 0.61 
 5 9.51500  1.00000  7450G  1237G  6212G 16.62 0.61 
 6 7.66499  1.00000  7450G  1264G  6186G 16.97 0.63 
 7 9.75499  1.00000  7450G  2494G  4955G 33.48 1.23 
 8 9.32999  1.00000  7450G  2491G  4958G 33.45 1.23 
              TOTAL 55872G 15152G 40719G 27.12      
MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54

# ceph health detail
HEALTH_WARN 4 pgs stuck unclean; recovery 3586/58734015 objects degraded (0.006%); recovery 420074/58734015 objects misplaced (0.715%); noscrub,nodeep-scrub flag(s) set
pg 9.7 is stuck unclean for 512936.160212, current state active+remapped, last acting [7,3,0]
pg 7.84 is stuck unclean for 512623.894574, current state active+remapped, last acting [4,8,1]
pg 8.1b is stuck unclean for 513164.616377, current state active+remapped, last acting [4,7,2]
pg 7.7a is stuck unclean for 513162.316328, current state active+remapped, last acting [7,4,2]
recovery 3586/58734015 objects degraded (0.006%)
recovery 420074/58734015 objects misplaced (0.715%)
noscrub,nodeep-scrub flag(s) set

# ceph osd tree
ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 56.00693 root default                                     
-2  1.28899     host ceph1                                   
 0  1.28899         osd.0       up  1.00000          1.00000 
-3  1.57899     host ceph2                                   
 1  1.57899         osd.1       up  1.00000          1.00000 
-4  1.68900     host ceph3                                   
 2  1.68900         osd.2       up  1.00000          1.00000 
-5 32.36497     host ceph4                                   
 3  6.78499         osd.3       up  1.00000          1.00000 
 4  8.39999         osd.4       up  1.00000          1.00000 
 5  9.51500         osd.5       up  1.00000          1.00000 
 6  7.66499         osd.6       up  1.00000          1.00000 
-6 19.08498     host ceph5                                   
 7  9.75499         osd.7       up  1.00000          1.00000 
 8  9.32999         osd.8       up  1.00000          1.00000 

I’m using a customized crushmap because as you can see this cluster is not very optimal. Ceph1, ceph2 and ceph3 are vms on one physical host - Ceph4 and Ceph5 are both separate physical hosts. So the idea is to spread 33% of the data to ceph1, ceph2 and ceph3 and the other 66% to each ceph4 and ceph5.

Everything went fine with the backfilling but now I see those 4 pgs stuck active+remapped since 2 days while the degrades objects increase. 

I did a restart of all osds after and after but this helped not really. It first showed me no degraded objects and then increased again.

What can I do in order to get those pgs to active+clean state again? My idea was to increase the weight of a osd a little bit in order to let ceph calculate the map again, is this a good idea?

---

On the other side I saw something very strange too: After the backfill was done (2 days ago), my ceph osd df looked like this:

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  
 0 1.28899  1.00000  3724G  1924G  1799G 51.67 1.79 
 1 1.57899  1.00000  3724G  2143G  1580G 57.57 2.00 
 2 1.68900  1.00000  3724G  2114G  1609G 56.78 1.97 
 3 6.78499  1.00000  7450G  1234G  6215G 16.57 0.58 
 4 8.39999  1.00000  7450G  1221G  6228G 16.40 0.57 
 5 9.51500  1.00000  7450G  1232G  6217G 16.54 0.57 
 6 7.66499  1.00000  7450G  1258G  6191G 16.89 0.59 
 7 9.75499  1.00000  7450G  2482G  4967G 33.33 1.16 
 8 9.32999  1.00000  7450G  2480G  4969G 33.30 1.16 
              TOTAL 55872G 16093G 39779G 28.80      
MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54

While ceph -s was:

     health HEALTH_WARN
            4 pgs stuck unclean
            recovery 1698/58476648 objects degraded (0.003%)
            recovery 418137/58476648 objects misplaced (0.715%)
            noscrub,nodeep-scrub flag(s) set
     monmap e9: 5 mons at {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
            election epoch 464, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
     osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
            16093 GB used, 39779 GB / 55872 GB avail
            1698/58476648 objects degraded (0.003%)
            418137/58476648 objects misplaced (0.715%)
                 316 active+clean
                   4 active+remapped
  client io 757 kB/s rd, 1 op/s

As you can see above my ceph osd df looks completely different -> This shows that the first three osds lost data (about 1 TB) without any backfill going on. If I calculate the amount of osd0, osd1 and osd2 it was 6181 GB. But there should be only around 33%, so this would be wrong.

My question on this is: Is this a bug and I really lost important data or is this a ceph cleanup action after the backfill? 

Thanks and regards,
Marcus

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com