PGs stuck down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

 

We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
do very well :( We ended up with 98% of PGs as down.

 

This setup has 2 data centers defined, with 4 copies across both, and a
minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
center connected to each of the other 2 by VPN.

 

When I did a pg query on the PG's that were stuck it said they were blocked
from coming up because they couldn't contact 2 of the OSDs (located in the
other data center that it was unable to reach).. but the other 2 were fine.

 

I'm at a loss because this was exactly the thing we thought we had set it up
to prevent.. and with size = 4 and min_size set = 1 I understood that it
would continue without a problem? :(

 

Crush map is below .. if anyone has any ideas? I would sincerely appreciate
it :)

 

Thanks!

Dale

 

# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable chooseleaf_vary_r 1

tunable straw_calc_version 1

 

# devices

device 0 osd.0 class ssd

device 1 osd.1 class ssd

device 2 osd.2 class ssd

device 3 osd.3 class ssd

device 4 osd.4 class ssd

device 5 osd.5 class ssd

device 6 osd.6 class ssd

device 7 osd.7 class ssd

device 8 osd.8 class ssd

device 9 osd.9 class ssd

device 10 osd.10 class ssd

device 11 osd.11 class ssd

device 12 osd.12 class ssd

device 13 osd.13 class ssd

device 14 osd.14 class ssd

device 15 osd.15 class ssd

device 16 osd.16 class ssd

device 17 osd.17 class ssd

device 18 osd.18 class ssd

device 19 osd.19 class ssd

device 20 osd.20 class ssd

device 21 osd.21 class ssd

device 22 osd.22 class ssd

device 23 osd.23 class ssd

device 24 osd.24 class ssd

device 25 osd.25 class ssd

device 26 osd.26 class ssd

device 27 osd.27 class ssd

device 28 osd.28 class ssd

device 29 osd.29 class ssd

device 30 osd.30 class ssd

device 31 osd.31 class ssd

device 32 osd.32 class ssd

device 33 osd.33 class ssd

device 34 osd.34 class ssd

device 35 osd.35 class ssd

device 36 osd.36 class ssd

device 37 osd.37 class ssd

device 38 osd.38 class ssd

device 39 osd.39 class ssd

device 40 osd.40 class ssd

device 41 osd.41 class ssd

device 42 osd.42 class ssd

device 43 osd.43 class ssd

device 44 osd.44 class ssd

device 45 osd.45 class ssd

device 46 osd.46 class ssd

device 47 osd.47 class ssd

device 49 osd.49 class ssd

 

# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root

 

# buckets

host Pnode01 {

        id -8           # do not change unnecessarily

        id -9 class ssd         # do not change unnecessarily

        # weight 0.000

        alg straw2

        hash 0  # rjenkins1

}

host node01 {

        id -2           # do not change unnecessarily

        id -15 class ssd                # do not change unnecessarily

        # weight 14.537

        alg straw2

        hash 0  # rjenkins1

        item osd.4 weight 1.817

        item osd.1 weight 1.817

        item osd.3 weight 1.817

        item osd.2 weight 1.817

        item osd.6 weight 1.817

        item osd.9 weight 1.817

        item osd.5 weight 1.817

        item osd.0 weight 1.818

}

host node02 {

        id -3           # do not change unnecessarily

        id -16 class ssd                # do not change unnecessarily

        # weight 14.536

        alg straw2

        hash 0  # rjenkins1

        item osd.10 weight 1.817

        item osd.11 weight 1.817

        item osd.12 weight 1.817

        item osd.13 weight 1.817

        item osd.14 weight 1.817

        item osd.15 weight 1.817

        item osd.16 weight 1.817

        item osd.19 weight 1.817

}

host node03 {

        id -4           # do not change unnecessarily

        id -17 class ssd                # do not change unnecessarily

        # weight 14.536

        alg straw2

        hash 0  # rjenkins1

        item osd.20 weight 1.817

        item osd.21 weight 1.817

        item osd.22 weight 1.817

        item osd.23 weight 1.817

        item osd.25 weight 1.817

        item osd.26 weight 1.817

        item osd.29 weight 1.817

        item osd.24 weight 1.817

}

datacenter EDM1 {

        id -11          # do not change unnecessarily

        id -14 class ssd                # do not change unnecessarily

        # weight 43.609

        alg straw

        hash 0  # rjenkins1

        item node01 weight 14.537

        item node02 weight 14.536

        item node03 weight 14.536

}

host node04 {

        id -5           # do not change unnecessarily

        id -18 class ssd                # do not change unnecessarily

        # weight 14.536

        alg straw2

        hash 0  # rjenkins1

        item osd.30 weight 1.817

        item osd.31 weight 1.817

        item osd.32 weight 1.817

        item osd.33 weight 1.817

        item osd.34 weight 1.817

        item osd.35 weight 1.817

        item osd.36 weight 1.817

        item osd.39 weight 1.817

}

host node05 {

        id -6           # do not change unnecessarily

        id -19 class ssd                # do not change unnecessarily

        # weight 14.536

        alg straw2

        hash 0  # rjenkins1

        item osd.40 weight 1.817

        item osd.41 weight 1.817

        item osd.42 weight 1.817

        item osd.43 weight 1.817

        item osd.44 weight 1.817

        item osd.45 weight 1.817

        item osd.46 weight 1.817

        item osd.49 weight 1.817

}

host node06 {

        id -7           # do not change unnecessarily

        id -20 class ssd                # do not change unnecessarily

        # weight 16.353

        alg straw2

        hash 0  # rjenkins1

        item osd.47 weight 1.817

        item osd.37 weight 1.817

        item osd.27 weight 1.817

        item osd.38 weight 1.817

        item osd.7 weight 1.817

        item osd.28 weight 1.817

        item osd.8 weight 1.817

        item osd.17 weight 1.817

        item osd.18 weight 1.817

}

datacenter EDM3 {

        id -12          # do not change unnecessarily

        id -13 class ssd                # do not change unnecessarily

        # weight 45.425

        alg straw

        hash 0  # rjenkins1

        item node04 weight 14.536

        item node05 weight 14.536

        item node06 weight 16.353

}

datacenter EDM2 {

        id -10          # do not change unnecessarily

        id -22 class ssd                # do not change unnecessarily

        # weight 0.000

        alg straw

        hash 0  # rjenkins1

}

root default {

        id -1           # do not change unnecessarily

        id -21 class ssd                # do not change unnecessarily

        # weight 89.034

        alg straw2

        hash 0  # rjenkins1

        item Pnode01 weight 0.000

        item EDM1 weight 43.609

        item EDM3 weight 45.425

        item EDM2 weight 0.000

}

 

# rules

rule replicated_ruleset {

        id 0

        type replicated

        min_size 1

        max_size 10

        step take default

        step choose firstn 2 type datacenter

        step chooseleaf firstn 2 type host

        step emit

}

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux