Re: 1 PG stuck unclean (active+remapped) after OSD replacement

Eugen Block <eblock@xxxxxx> · Mon, 13 Feb 2017 16:41:41 +0100

Thanks for your quick responses,

while I was writing my answer we had a rebalancing going on because I  
started a new crush reweight to get rid of the old re-activated OSDs  
again, and now that it finished, the cluster is back in healthy state.

Thanks,
Eugen

Zitat von Gregory Farnum <gfarnum@xxxxxxxxxx>:

On Mon, Feb 13, 2017 at 7:05 AM Wido den Hollander <wido@xxxxxxxx> wrote:

> Op 13 februari 2017 om 16:03 schreef Eugen Block <eblock@xxxxxx>:
>
>
> Hi experts,
>
> I have a strange situation right now. We are re-organizing our 4 node
> Hammer cluster from LVM-based OSDs to HDDs. When we did this on the
> first node last week, everything went smoothly, I removed the OSDs
> from the crush map and the rebalancing and recovery finished
> successfully.
> This weekend we did the same with the second node, we created the
> HDD-based OSDs and added them to the cluster, waited for rebalancing
> to finish and then stopped the old OSDs. Only this time the recovery
> didn't completely finish, 4 PGs kept stuck unclean. I found out that 3
> of these 4 PGs had their primary OSD on that node. So I restarted the
> respective services and those 3 PGs recovered successfully. But there
> is one last PG that gives me headaches.
>
> ceph@ndesan01:~ # ceph pg map 1.3d3
> osdmap e24320 pg 1.3d3 (1.3d3) -> up [16,21] acting [16,21,0]
>

What version of Ceph? And could it be that the cluster has old CRUSH
tunables? When was it installed with which Ceph version?

I'm not sure it even takes old tunables. With half the weight in one bucket
(that last host) it's going to have trouble. Assuming things will balance
out when the transition is done, I'd just keep going, especially since the
three acting replicas are sticking around.
-Greg

Wido

> ceph@ndesan01:~/ceph-deploy> ceph osd tree
> ID WEIGHT  TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 9.38985 root default
> -2 1.19995     host ndesan01
>   0 0.23999         osd.0          up  1.00000          1.00000
>   1 0.23999         osd.1          up  1.00000          1.00000
>   2 0.23999         osd.2          up  1.00000          1.00000
> 13 0.23999         osd.13         up  1.00000          1.00000
> 19 0.23999         osd.19         up  1.00000          1.00000
> -3 1.81998     host ndesan02
>   3       0         osd.3        down        0          1.00000
>   4       0         osd.4        down        0          1.00000
>   5       0         osd.5        down        0          1.00000
>   9       0         osd.9        down  1.00000          1.00000
> 10       0         osd.10       down  1.00000          1.00000
>   6 0.90999         osd.6          up  1.00000          1.00000
>   7 0.90999         osd.7          up  1.00000          1.00000
> -4 1.81998     host nde32
> 20 0.90999         osd.20         up  1.00000          1.00000
> 21 0.90999         osd.21         up  1.00000          1.00000
> -5 4.54994     host ndesan03
> 14 0.90999         osd.14         up  1.00000          1.00000
> 15 0.90999         osd.15         up  1.00000          1.00000
> 16 0.90999         osd.16         up  1.00000          1.00000
> 17 0.90999         osd.17         up  1.00000          1.00000
> 18 0.90999         osd.18         up  1.00000          1.00000
>
>
> All OSDs marked as "down" are going to be removed. I looked for that
> PG on all 3 nodes, and all of them have it. All services are up and
> running, but for some reason this PG is not aware of that. Is there
> any reasonable explanation and/or some advice how to get that PG
> recovered?
>
> One thing I noticed:
>
> The data on the primary OSD (osd.16) had different timestamps than on
> the other two OSDs:
>
> ---cut here---
> ndesan03:~ # ls -rtl /var/lib/ceph/osd/ceph-16/current/1.3d3_head/
> total 389436
> -rw-r--r-- 1 root root       0 Jul 12  2016 __head_000003D3__1
> ...
> -rw-r--r-- 1 root root       0 Jan  9 10:43
> rbd\udata.bca465368d6b49.0000000000000a06__head_20EFF3D3__1
> -rw-r--r-- 1 root root       0 Jan  9 10:43
> rbd\udata.bca465368d6b49.0000000000000a8b__head_A014F3D3__1
> -rw-r--r-- 1 root root       0 Jan  9 10:44
> rbd\udata.bca465368d6b49.0000000000000e2c__head_00F2D3D3__1
> -rw-r--r-- 1 root root       0 Jan  9 10:44
> rbd\udata.bca465368d6b49.0000000000000e6a__head_C91813D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 13:53
> rbd\udata.cc94344e6afb66.00000000000008cb__head_6AA4B3D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 14:47
> rbd\udata.e15aee238e1f29.00000000000005f0__head_C95063D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 15:10
> rbd\udata.e15aee238e1f29.0000000000000d15__head_FF1083D3__1
> -rw-r--r-- 1 root root 8388608 Jan 20 15:19
> rbd\udata.e15aee238e1f29.000000000000100c__head_6B17F3D3__1
> -rw-r--r-- 1 root root 8388608 Jan 23 14:17
> rbd\udata.e73cf7b03e0c6.0000000000000479__head_C16003D3__1
> -rw-r--r-- 1 root root 8388608 Jan 25 11:52
> rbd\udata.d4edc95e884adc.00000000000000f4__head_00EE43D3__1
> -rw-r--r-- 1 root root 4194304 Jan 27 08:07
> rbd\udata.34595be2237e6.0000000000000ad5__head_D3CC93D3__1
> -rw-r--r-- 1 root root 4194304 Jan 27 08:08
> rbd\udata.34595be2237e6.0000000000000aff__head_3BF633D3__1
> -rw-r--r-- 1 root root 4194304 Jan 27 16:20
> rbd\udata.8b61c69f34baf.000000000000876a__head_A60A63D3__1
> -rw-r--r-- 1 root root 4194304 Jan 29 17:45
> rbd\udata.28fcaf199543c3.0000000000000ae7__head_C1BA53D3__1
> -rw-r--r-- 1 root root 4194304 Jan 30 06:33
> rbd\udata.28fcaf199543c3.0000000000001832__head_6EC113D3__1
> -rw-r--r-- 1 root root 4194304 Jan 31 10:33
> rb.0.ddcdf5.238e1f29.0000000000e4__head_3F1543D3__1
> -rw-r--r-- 1 root root 4194304 Feb 13 06:14
> rbd\udata.856071751c29d.000000000000617b__head_E1E4A3D3__1
> ---cut here---
>
> The other two OSDs have identical timestamps, I just post the
> (shortened) output of osd.21:
>
> ---cut here---
> nde32:/var/lib/ceph/osd/ceph-21/current # ls -lrt
> /var/lib/ceph/osd/ceph-21/current/1.3d3_head/
> total 389432
> -rw-r--r-- 1 root root       0 Feb  6 15:29 __head_000003D3__1
> ...
> -rw-r--r-- 1 root root       0 Feb  6 16:46
> rbd\udata.a00851d652069.00000000000007a4__head_C55DB3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.947feb21a163a2.0000000000004349__head_A37FB3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.8b61c69f34baf.00000000000068cb__head_B4A2C3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.874a620334da.00000000000004ed__head_3835C3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.8b61c69f34baf.0000000000004424__head_5BA7C3D3__1
> -rw-r--r-- 1 root root 8388608 Feb  6 16:47
> rbd\udata.31a3e57d64476.0000000000000418__head_B158C3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.1128db1b5d2111.00000000000002eb__head_81AAC3D3__1
> -rw-r--r-- 1 root root       0 Feb  6 16:47
> rbd\udata.bca465368d6b49.0000000000000e2c__head_00F2D3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.2d6fe91cf37a46.000000000000019e__head_2346D3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.856071751c29d.0000000000006134__head_C876E3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.949da61c92b32c.0000000000000a18__head_397BE3D3__1
> -rw-r--r-- 1 root root 8388608 Feb  6 16:47
> rbd\udata.567d57d819eed.000000000000034f__head_FC83F3D3__1
> -rw-r--r-- 1 root root       0 Feb  6 16:47
> rbd\udata.bca465368d6b49.0000000000000a8b__head_A014F3D3__1
> -rw-r--r-- 1 root root 4194304 Feb  6 16:47
> rbd\udata.856071751c29d.0000000000003a2c__head_0684F3D3__1
> -rw-r--r-- 1 root root 8388608 Feb  6 16:47
> rbd\udata.e15aee238e1f29.000000000000100c__head_6B17F3D3__1
> -rw-r--r-- 1 root root       0 Feb  6 16:47
> rbd\udata.bca465368d6b49.0000000000000a06__head_20EFF3D3__1
> -rw-r--r-- 1 root root 4194304 Feb 13 06:14
> rbd\udata.856071751c29d.000000000000617b__head_E1E4A3D3__1
> ---cut here---
>
> So I figured that the data on the primary OSD could be the problem and
> copied the content from one of the other OSDs, restarted all 3 OSDs,
> but the status didn't change. How can I repair this PG?
>
> Another question about OSD replacement: why didn't the cluster switch
> the primary OSD for all PGs when the OSDs went down? If this was a
> real disk failure, I have doubts about a full recovery. Or should I
> have deleted that PG instead of re-activating old OSDs? I'm not sure
> what the best practice would be in this case.
>
> Any help is appreciated!
>
> Regards,
> Eugen
>
> --
> Eugen Block                             voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
> Postfach 61 03 15
> D-22423 Hamburg                         e-mail  : eblock@xxxxxx
>
>          Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>            Sitz und Registergericht: Hamburg, HRB 90934
>                    Vorstand: Jens-U. Mozdzen
>                     USt-IdNr. DE 814 013 983
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Eugen Block                             voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg                         e-mail  : eblock@xxxxxx

        Vorsitzende des Aufsichtsrates: Angelika Mozdzen
          Sitz und Registergericht: Hamburg, HRB 90934
                  Vorstand: Jens-U. Mozdzen
                   USt-IdNr. DE 814 013 983

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com