1 PG stuck unclean (active+remapped) after OSD replacement

Eugen Block <eblock@xxxxxx> · Mon, 13 Feb 2017 16:03:18 +0100

Hi experts,

I have a strange situation right now. We are re-organizing our 4 node  
Hammer cluster from LVM-based OSDs to HDDs. When we did this on the  
first node last week, everything went smoothly, I removed the OSDs  
from the crush map and the rebalancing and recovery finished  
successfully.
This weekend we did the same with the second node, we created the  
HDD-based OSDs and added them to the cluster, waited for rebalancing  
to finish and then stopped the old OSDs. Only this time the recovery  
didn't completely finish, 4 PGs kept stuck unclean. I found out that 3  
of these 4 PGs had their primary OSD on that node. So I restarted the  
respective services and those 3 PGs recovered successfully. But there  
is one last PG that gives me headaches.

ceph@ndesan01:~ # ceph pg map 1.3d3
osdmap e24320 pg 1.3d3 (1.3d3) -> up [16,21] acting [16,21,0]

ceph@ndesan01:~/ceph-deploy> ceph osd tree
ID WEIGHT  TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 9.38985 root default
-2 1.19995     host ndesan01
 0 0.23999         osd.0          up  1.00000          1.00000
 1 0.23999         osd.1          up  1.00000          1.00000
 2 0.23999         osd.2          up  1.00000          1.00000
13 0.23999         osd.13         up  1.00000          1.00000
19 0.23999         osd.19         up  1.00000          1.00000
-3 1.81998     host ndesan02
 3       0         osd.3        down        0          1.00000
 4       0         osd.4        down        0          1.00000
 5       0         osd.5        down        0          1.00000
 9       0         osd.9        down  1.00000          1.00000
10       0         osd.10       down  1.00000          1.00000
 6 0.90999         osd.6          up  1.00000          1.00000
 7 0.90999         osd.7          up  1.00000          1.00000
-4 1.81998     host nde32
20 0.90999         osd.20         up  1.00000          1.00000
21 0.90999         osd.21         up  1.00000          1.00000
-5 4.54994     host ndesan03
14 0.90999         osd.14         up  1.00000          1.00000
15 0.90999         osd.15         up  1.00000          1.00000
16 0.90999         osd.16         up  1.00000          1.00000
17 0.90999         osd.17         up  1.00000          1.00000
18 0.90999         osd.18         up  1.00000          1.00000

All OSDs marked as "down" are going to be removed. I looked for that  
PG on all 3 nodes, and all of them have it. All services are up and  
running, but for some reason this PG is not aware of that. Is there  
any reasonable explanation and/or some advice how to get that PG  
recovered?

One thing I noticed:

The data on the primary OSD (osd.16) had different timestamps than on  
the other two OSDs:

---cut here---
ndesan03:~ # ls -rtl /var/lib/ceph/osd/ceph-16/current/1.3d3_head/
total 389436
-rw-r--r-- 1 root root       0 Jul 12  2016 __head_000003D3__1
...
-rw-r--r-- 1 root root       0 Jan  9 10:43  
rbd\udata.bca465368d6b49.0000000000000a06__head_20EFF3D3__1
-rw-r--r-- 1 root root       0 Jan  9 10:43  
rbd\udata.bca465368d6b49.0000000000000a8b__head_A014F3D3__1
-rw-r--r-- 1 root root       0 Jan  9 10:44  
rbd\udata.bca465368d6b49.0000000000000e2c__head_00F2D3D3__1
-rw-r--r-- 1 root root       0 Jan  9 10:44  
rbd\udata.bca465368d6b49.0000000000000e6a__head_C91813D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 13:53  
rbd\udata.cc94344e6afb66.00000000000008cb__head_6AA4B3D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 14:47  
rbd\udata.e15aee238e1f29.00000000000005f0__head_C95063D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 15:10  
rbd\udata.e15aee238e1f29.0000000000000d15__head_FF1083D3__1
-rw-r--r-- 1 root root 8388608 Jan 20 15:19  
rbd\udata.e15aee238e1f29.000000000000100c__head_6B17F3D3__1
-rw-r--r-- 1 root root 8388608 Jan 23 14:17  
rbd\udata.e73cf7b03e0c6.0000000000000479__head_C16003D3__1
-rw-r--r-- 1 root root 8388608 Jan 25 11:52  
rbd\udata.d4edc95e884adc.00000000000000f4__head_00EE43D3__1
-rw-r--r-- 1 root root 4194304 Jan 27 08:07  
rbd\udata.34595be2237e6.0000000000000ad5__head_D3CC93D3__1
-rw-r--r-- 1 root root 4194304 Jan 27 08:08  
rbd\udata.34595be2237e6.0000000000000aff__head_3BF633D3__1
-rw-r--r-- 1 root root 4194304 Jan 27 16:20  
rbd\udata.8b61c69f34baf.000000000000876a__head_A60A63D3__1
-rw-r--r-- 1 root root 4194304 Jan 29 17:45  
rbd\udata.28fcaf199543c3.0000000000000ae7__head_C1BA53D3__1
-rw-r--r-- 1 root root 4194304 Jan 30 06:33  
rbd\udata.28fcaf199543c3.0000000000001832__head_6EC113D3__1
-rw-r--r-- 1 root root 4194304 Jan 31 10:33  
rb.0.ddcdf5.238e1f29.0000000000e4__head_3F1543D3__1
-rw-r--r-- 1 root root 4194304 Feb 13 06:14  
rbd\udata.856071751c29d.000000000000617b__head_E1E4A3D3__1
---cut here---

The other two OSDs have identical timestamps, I just post the  
(shortened) output of osd.21:

---cut here---
nde32:/var/lib/ceph/osd/ceph-21/current # ls -lrt  
/var/lib/ceph/osd/ceph-21/current/1.3d3_head/
total 389432
-rw-r--r-- 1 root root       0 Feb  6 15:29 __head_000003D3__1
...
-rw-r--r-- 1 root root       0 Feb  6 16:46  
rbd\udata.a00851d652069.00000000000007a4__head_C55DB3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.947feb21a163a2.0000000000004349__head_A37FB3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.8b61c69f34baf.00000000000068cb__head_B4A2C3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.874a620334da.00000000000004ed__head_3835C3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.8b61c69f34baf.0000000000004424__head_5BA7C3D3__1
-rw-r--r-- 1 root root 8388608 Feb  6 16:47  
rbd\udata.31a3e57d64476.0000000000000418__head_B158C3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.1128db1b5d2111.00000000000002eb__head_81AAC3D3__1
-rw-r--r-- 1 root root       0 Feb  6 16:47  
rbd\udata.bca465368d6b49.0000000000000e2c__head_00F2D3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.2d6fe91cf37a46.000000000000019e__head_2346D3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.856071751c29d.0000000000006134__head_C876E3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.949da61c92b32c.0000000000000a18__head_397BE3D3__1
-rw-r--r-- 1 root root 8388608 Feb  6 16:47  
rbd\udata.567d57d819eed.000000000000034f__head_FC83F3D3__1
-rw-r--r-- 1 root root       0 Feb  6 16:47  
rbd\udata.bca465368d6b49.0000000000000a8b__head_A014F3D3__1
-rw-r--r-- 1 root root 4194304 Feb  6 16:47  
rbd\udata.856071751c29d.0000000000003a2c__head_0684F3D3__1
-rw-r--r-- 1 root root 8388608 Feb  6 16:47  
rbd\udata.e15aee238e1f29.000000000000100c__head_6B17F3D3__1
-rw-r--r-- 1 root root       0 Feb  6 16:47  
rbd\udata.bca465368d6b49.0000000000000a06__head_20EFF3D3__1
-rw-r--r-- 1 root root 4194304 Feb 13 06:14  
rbd\udata.856071751c29d.000000000000617b__head_E1E4A3D3__1
---cut here---

So I figured that the data on the primary OSD could be the problem and  
copied the content from one of the other OSDs, restarted all 3 OSDs,  
but the status didn't change. How can I repair this PG?

Another question about OSD replacement: why didn't the cluster switch  
the primary OSD for all PGs when the OSDs went down? If this was a  
real disk failure, I have doubts about a full recovery. Or should I  
have deleted that PG instead of re-activating old OSDs? I'm not sure  
what the best practice would be in this case.

Any help is appreciated!

Regards,
Eugen

--
Eugen Block                             voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg                         e-mail  : eblock@xxxxxx

        Vorsitzende des Aufsichtsrates: Angelika Mozdzen
          Sitz und Registergericht: Hamburg, HRB 90934
                  Vorstand: Jens-U. Mozdzen
                   USt-IdNr. DE 814 013 983

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com