Re: PGs stuck down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dale

Can you please post the ceph status ? I’m no expert but I would make sure that the datacenter you intend to operate (while the connection gets reestablished) has two active monitors. Thanks.

Yanko.


> On Nov 29, 2022, at 7:20 AM, Wolfpaw - Dale Corse <dale@xxxxxxxxxxx> wrote:
> 
> Hi All,
> 
> 
> 
> We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
> do very well :( We ended up with 98% of PGs as down.
> 
> 
> 
> This setup has 2 data centers defined, with 4 copies across both, and a
> minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
> center connected to each of the other 2 by VPN.
> 
> 
> 
> When I did a pg query on the PG's that were stuck it said they were blocked
> from coming up because they couldn't contact 2 of the OSDs (located in the
> other data center that it was unable to reach).. but the other 2 were fine.
> 
> 
> 
> I'm at a loss because this was exactly the thing we thought we had set it up
> to prevent.. and with size = 4 and min_size set = 1 I understood that it
> would continue without a problem? :(
> 
> 
> 
> Crush map is below .. if anyone has any ideas? I would sincerely appreciate
> it :)
> 
> 
> 
> Thanks!
> 
> Dale
> 
> 
> 
> # begin crush map
> 
> tunable choose_local_tries 0
> 
> tunable choose_local_fallback_tries 0
> 
> tunable choose_total_tries 50
> 
> tunable chooseleaf_descend_once 1
> 
> tunable chooseleaf_vary_r 1
> 
> tunable straw_calc_version 1
> 
> 
> 
> # devices
> 
> device 0 osd.0 class ssd
> 
> device 1 osd.1 class ssd
> 
> device 2 osd.2 class ssd
> 
> device 3 osd.3 class ssd
> 
> device 4 osd.4 class ssd
> 
> device 5 osd.5 class ssd
> 
> device 6 osd.6 class ssd
> 
> device 7 osd.7 class ssd
> 
> device 8 osd.8 class ssd
> 
> device 9 osd.9 class ssd
> 
> device 10 osd.10 class ssd
> 
> device 11 osd.11 class ssd
> 
> device 12 osd.12 class ssd
> 
> device 13 osd.13 class ssd
> 
> device 14 osd.14 class ssd
> 
> device 15 osd.15 class ssd
> 
> device 16 osd.16 class ssd
> 
> device 17 osd.17 class ssd
> 
> device 18 osd.18 class ssd
> 
> device 19 osd.19 class ssd
> 
> device 20 osd.20 class ssd
> 
> device 21 osd.21 class ssd
> 
> device 22 osd.22 class ssd
> 
> device 23 osd.23 class ssd
> 
> device 24 osd.24 class ssd
> 
> device 25 osd.25 class ssd
> 
> device 26 osd.26 class ssd
> 
> device 27 osd.27 class ssd
> 
> device 28 osd.28 class ssd
> 
> device 29 osd.29 class ssd
> 
> device 30 osd.30 class ssd
> 
> device 31 osd.31 class ssd
> 
> device 32 osd.32 class ssd
> 
> device 33 osd.33 class ssd
> 
> device 34 osd.34 class ssd
> 
> device 35 osd.35 class ssd
> 
> device 36 osd.36 class ssd
> 
> device 37 osd.37 class ssd
> 
> device 38 osd.38 class ssd
> 
> device 39 osd.39 class ssd
> 
> device 40 osd.40 class ssd
> 
> device 41 osd.41 class ssd
> 
> device 42 osd.42 class ssd
> 
> device 43 osd.43 class ssd
> 
> device 44 osd.44 class ssd
> 
> device 45 osd.45 class ssd
> 
> device 46 osd.46 class ssd
> 
> device 47 osd.47 class ssd
> 
> device 49 osd.49 class ssd
> 
> 
> 
> # types
> 
> type 0 osd
> 
> type 1 host
> 
> type 2 chassis
> 
> type 3 rack
> 
> type 4 row
> 
> type 5 pdu
> 
> type 6 pod
> 
> type 7 room
> 
> type 8 datacenter
> 
> type 9 region
> 
> type 10 root
> 
> 
> 
> # buckets
> 
> host Pnode01 {
> 
>        id -8           # do not change unnecessarily
> 
>        id -9 class ssd         # do not change unnecessarily
> 
>        # weight 0.000
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
> }
> 
> host node01 {
> 
>        id -2           # do not change unnecessarily
> 
>        id -15 class ssd                # do not change unnecessarily
> 
>        # weight 14.537
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
>        item osd.4 weight 1.817
> 
>        item osd.1 weight 1.817
> 
>        item osd.3 weight 1.817
> 
>        item osd.2 weight 1.817
> 
>        item osd.6 weight 1.817
> 
>        item osd.9 weight 1.817
> 
>        item osd.5 weight 1.817
> 
>        item osd.0 weight 1.818
> 
> }
> 
> host node02 {
> 
>        id -3           # do not change unnecessarily
> 
>        id -16 class ssd                # do not change unnecessarily
> 
>        # weight 14.536
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
>        item osd.10 weight 1.817
> 
>        item osd.11 weight 1.817
> 
>        item osd.12 weight 1.817
> 
>        item osd.13 weight 1.817
> 
>        item osd.14 weight 1.817
> 
>        item osd.15 weight 1.817
> 
>        item osd.16 weight 1.817
> 
>        item osd.19 weight 1.817
> 
> }
> 
> host node03 {
> 
>        id -4           # do not change unnecessarily
> 
>        id -17 class ssd                # do not change unnecessarily
> 
>        # weight 14.536
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
>        item osd.20 weight 1.817
> 
>        item osd.21 weight 1.817
> 
>        item osd.22 weight 1.817
> 
>        item osd.23 weight 1.817
> 
>        item osd.25 weight 1.817
> 
>        item osd.26 weight 1.817
> 
>        item osd.29 weight 1.817
> 
>        item osd.24 weight 1.817
> 
> }
> 
> datacenter EDM1 {
> 
>        id -11          # do not change unnecessarily
> 
>        id -14 class ssd                # do not change unnecessarily
> 
>        # weight 43.609
> 
>        alg straw
> 
>        hash 0  # rjenkins1
> 
>        item node01 weight 14.537
> 
>        item node02 weight 14.536
> 
>        item node03 weight 14.536
> 
> }
> 
> host node04 {
> 
>        id -5           # do not change unnecessarily
> 
>        id -18 class ssd                # do not change unnecessarily
> 
>        # weight 14.536
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
>        item osd.30 weight 1.817
> 
>        item osd.31 weight 1.817
> 
>        item osd.32 weight 1.817
> 
>        item osd.33 weight 1.817
> 
>        item osd.34 weight 1.817
> 
>        item osd.35 weight 1.817
> 
>        item osd.36 weight 1.817
> 
>        item osd.39 weight 1.817
> 
> }
> 
> host node05 {
> 
>        id -6           # do not change unnecessarily
> 
>        id -19 class ssd                # do not change unnecessarily
> 
>        # weight 14.536
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
>        item osd.40 weight 1.817
> 
>        item osd.41 weight 1.817
> 
>        item osd.42 weight 1.817
> 
>        item osd.43 weight 1.817
> 
>        item osd.44 weight 1.817
> 
>        item osd.45 weight 1.817
> 
>        item osd.46 weight 1.817
> 
>        item osd.49 weight 1.817
> 
> }
> 
> host node06 {
> 
>        id -7           # do not change unnecessarily
> 
>        id -20 class ssd                # do not change unnecessarily
> 
>        # weight 16.353
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
>        item osd.47 weight 1.817
> 
>        item osd.37 weight 1.817
> 
>        item osd.27 weight 1.817
> 
>        item osd.38 weight 1.817
> 
>        item osd.7 weight 1.817
> 
>        item osd.28 weight 1.817
> 
>        item osd.8 weight 1.817
> 
>        item osd.17 weight 1.817
> 
>        item osd.18 weight 1.817
> 
> }
> 
> datacenter EDM3 {
> 
>        id -12          # do not change unnecessarily
> 
>        id -13 class ssd                # do not change unnecessarily
> 
>        # weight 45.425
> 
>        alg straw
> 
>        hash 0  # rjenkins1
> 
>        item node04 weight 14.536
> 
>        item node05 weight 14.536
> 
>        item node06 weight 16.353
> 
> }
> 
> datacenter EDM2 {
> 
>        id -10          # do not change unnecessarily
> 
>        id -22 class ssd                # do not change unnecessarily
> 
>        # weight 0.000
> 
>        alg straw
> 
>        hash 0  # rjenkins1
> 
> }
> 
> root default {
> 
>        id -1           # do not change unnecessarily
> 
>        id -21 class ssd                # do not change unnecessarily
> 
>        # weight 89.034
> 
>        alg straw2
> 
>        hash 0  # rjenkins1
> 
>        item Pnode01 weight 0.000
> 
>        item EDM1 weight 43.609
> 
>        item EDM3 weight 45.425
> 
>        item EDM2 weight 0.000
> 
> }
> 
> 
> 
> # rules
> 
> rule replicated_ruleset {
> 
>        id 0
> 
>        type replicated
> 
>        min_size 1
> 
>        max_size 10
> 
>        step take default
> 
>        step choose firstn 2 type datacenter
> 
>        step chooseleaf firstn 2 type host
> 
>        step emit
> 
> }
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux