Re: Flapping OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The issue is now fixed, it turns out i have unnecessary iptables rules, flushed and deleted them all, restarted the OSDs and now they are running normally.



Regards,

Vladimir FS Blando
Cloud Operations Manager

On Fri, Apr 7, 2017 at 1:17 PM, Vlad Blando <vblando@xxxxxxxxxxxxx> wrote:
Hi Brian,

Will check on that also.



On Mon, Apr 3, 2017 at 4:53 PM, Brian : <brians@xxxxxxxx> wrote:
Hi Vlad

Is there anything in syslog on any of the hosts when this happens?

Had a similar issue with a single node recently and it was caused by a firmware issue on a single ssd. That would cause the controller to reset and osds on that node would flap as a result. 

flashed the SSD with new FW and issue hasn't come up since. 

Brian


On Mon, Apr 3, 2017 at 8:03 AM, Vlad Blando <vblando@xxxxxxxxxxxxx> wrote:
Most of the time random and most of the time 1 at a time, but I also see 2-3 that are down at the same time.

The network seems fine, the bond seems fine, I just don't know where to look anymore. My other option is to redo the server but that's the last resort, as much as possible I don't want to.



On Mon, Apr 3, 2017 at 2:24 PM, Maxime Guyot <Maxime.Guyot@xxxxxxxxx> wrote:

Hi Vlad,

 

I am curious if those OSDs are flapping all at once? If a single host is affected I would consider the network connectivity (bottlenecks and misconfigured bonds can generate strange situations), storage controller and firmware.

 

Cheers,

Maxime

 

From: ceph-users <ceph-users-bounces@xxxxxxxxxx.com> on behalf of Vlad Blando <vblando@xxxxxxxxxxxxx>
Date: Sunday 2 April 2017 16:28
To: ceph-users <ceph-users@xxxxxxxx>
Subject: Flapping OSDs

 

Hi,

 

One of my ceph nodes have flapping OSDs, network between nodes are fine, it's on a 10GBase-T network. I don't see anything wrong with the network, but these OSDs are going up/down.

 

[root@avatar0-ceph4 ~]# ceph osd tree

# id    weight  type name       up/down reweight

-1      174.7   root default

-2      29.12           host avatar0-ceph2

16      3.64                    osd.16  up      1

17      3.64                    osd.17  up      1

18      3.64                    osd.18  up      1

19      3.64                    osd.19  up      1

20      3.64                    osd.20  up      1

21      3.64                    osd.21  up      1

22      3.64                    osd.22  up      1

23      3.64                    osd.23  up      1

-3      29.12           host avatar0-ceph0

0       3.64                    osd.0   up      1

1       3.64                    osd.1   up      1

2       3.64                    osd.2   up      1

3       3.64                    osd.3   up      1

4       3.64                    osd.4   up      1

5       3.64                    osd.5   up      1

6       3.64                    osd.6   up      1

7       3.64                    osd.7   up      1

-4      29.12           host avatar0-ceph1

8       3.64                    osd.8   up      1

9       3.64                    osd.9   up      1

10      3.64                    osd.10  up      1

11      3.64                    osd.11  up      1

12      3.64                    osd.12  up      1

13      3.64                    osd.13  up      1

14      3.64                    osd.14  up      1

15      3.64                    osd.15  up      1

-5      29.12           host avatar0-ceph3

24      3.64                    osd.24  up      1

25      3.64                    osd.25  up      1

26      3.64                    osd.26  up      1

27      3.64                    osd.27  up      1

28      3.64                    osd.28  up      1

29      3.64                    osd.29  up      1

30      3.64                    osd.30  up      1

31      3.64                    osd.31  up      1

-6      29.12           host avatar0-ceph4

32      3.64                    osd.32  up      1

33      3.64                    osd.33  up      1

34      3.64                    osd.34  up      1

35      3.64                    osd.35  up      1

36      3.64                    osd.36  up      1

37      3.64                    osd.37  up      1

38      3.64                    osd.38  up      1

39      3.64                    osd.39  up      1

-7      29.12           host avatar0-ceph5

40      3.64                    osd.40  up      1

41      3.64                    osd.41  up      1

42      3.64                    osd.42  up      1

43      3.64                    osd.43  up      1

44      3.64                    osd.44  up      1

45      3.64                    osd.45  up      1

46      3.64                    osd.46  up      1

47      3.64                    osd.47  up      1

[root@avatar0-ceph4 ~]#

 

 

Here is my ceph.conf

---

[root@avatar0-ceph4 ~]# cat /etc/ceph/ceph.conf

[global]

fsid = 2f0d1928-2ee5-4731-a259-64c0dc16110a

mon_initial_members = avatar0-ceph0, avatar0-ceph1, avatar0-ceph2

mon_host = 172.40.40.100,172.40.40.101,172.40.40.102

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_pool_default_size = 2

osd_pool_default_min_size = 1

cluster_network = 172.50.50.0/24

public_network = 172.40.40.0/24

max_open_files = 131072

mon_clock_drift_allowed = .15

mon_clock_drift_warn_backoff = 30

mon_osd_down_out_interval = 300

mon_osd_report_timeout = 300

mon_osd_min_down_reporters = 3

 

 

[osd]

filestore_merge_threshold = 40

filestore_split_multiple = 8

osd_op_threads = 8

osd_max_backfills = 1

osd_recovery_op_priority = 1

osd_recovery_max_active = 1

 

[client]

rbd_cache = true

rbd_cache_writethrough_until_flush = true

---

 

Here's the log snippet on osd.34

---

2017-04-02 22:26:10.371282 7f1064eab700  0 -- 172.50.50.105:6816/117130897 >> 172.50.50.101:6808/190698 pipe(0x156a1b80 sd=124 :46536 s=2 pgs=966 cs=1 l=0 c=0x13ae19c0).fault with nothing to send, going to standby

2017-04-02 22:26:10.371360 7f106ed5c700  0 -- 172.50.50.105:6816/117130897 >> 172.50.50.104:6822/181109 pipe(0x1018c2c0 sd=75 :34196 s=2 pgs=1022 cs=1 l=0 c=0x1098fa20).fault with nothing to send, going to standby

2017-04-02 22:26:10.371393 7f1067ad2700  0 -- 172.50.50.105:6816/117130897 >> 172.50.50.103:6806/118813 pipe(0x166b5c80 sd=34 :34156 s=2 pgs=1041 cs=1 l=0 c=0x10c4bfa0).fault with nothing to send, going to standby

2017-04-02 22:26:10.371739 7f107137b700  0 -- 172.50.50.105:6816/117130897 >> 172.50.50.103:6815/121286 pipe(0xd7eb8c0 sd=192 :43966 s=2 pgs=1042 cs=1 l=0 c=0x12eb7e40).fault with nothing to send, going to standby

2017-04-02 22:26:10.375016 7f1068ff7700  0 -- 172.50.50.105:6816/117130897 >> 172.50.50.104:6812/183825 pipe(0x10f70c00 sd=61 :34442 s=2 pgs=1025 cs=1 l=0 c=0xb10be40).fault with nothing to send, going to standby

2017-04-02 22:26:10.375221 7f107157d700  0 -- 172.50.50.105:6816/117130897 >> 172.50.50.102:6806/66401 pipe(0x10ba78c0 sd=191 :46312 s=2 pgs=988 cs=1 l=0 c=0x6f8c420).fault with nothing to send, going to standby

2017-04-02 22:26:11.041747 7f10885ab700  0 log_channel(default) log [WRN] : map e85725 wrongly marked me down

2017-04-02 22:26:16.427858 7f1062892700  0 -- 172.50.50.105:6807/118130897 >> 172.50.50.105:6811/116133701 pipe(0xd4cb180 sd=257 :6807 s=0 pgs=0 cs=0 l=0 c=0x13ae07e0).accept connect_seq 0 vs existing 0 state connecting

2017-04-02 22:26:16.427897 7f1062993700  0 -- 172.50.50.105:6807/118130897 >> 172.50.50.105:6811/116133701 pipe(0xfb50680 sd=76 :56374 s=4 pgs=0 cs=0 l=0 c=0x174255a0).connect got RESETSESSION but no longer connecting

---

 

 

 

 



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux