Turns out this was due to a switch misconfiguration on the cluster network. I use jumbo frames and essentially the new server's connections were not configured with the correct MTU on the switch. So this caused some traffic to flow, but eventually the servers wanted to send larger frame sizes than the switch allowed, and that prevented OSDs from receiving osd_ping messages. Sorry for the noise! Josh On Thu, Mar 14, 2019 at 3:45 PM Josh Haft <paccrap@xxxxxxxxx> wrote: > > Hello fellow Cephers, > > My 12.2.2 cluster is pretty full so I've been adding new nodes/OSDs. > Last week I added two new nodes with 12 OSDs each and they are still > backfilling. I have max_backfills tuned quite low across the board to > minimize client impact. Yesterday I brought two more nodes online each > with 12 OSDs and added them to the crushmap under a staging root, > planning to add those to root=default when the two from last week > complete backfilling. When the OSDs processes came up they all did > what I describe below and since it only takes two OSDs on different > hosts... the mons started marking existing OSDs down. So I backed that > out and am now just working with a single OSD on of the new nodes > until I can figure this out. > > When the OSD process starts up it's listening on ports 6800 and 6801 > on both the cluster and public interfaces. It successfully gets the > current osdmap from a monitor and chooses 10 OSDs to peer with, all of > which fail. > > It doesn't appear to be a basic networking issue; I turned up debug > osd and ms to 20 and based on the following it looks like a successful > ping/reply with the OSD peer (osd.0), but after a while the log says > it's never heard from this OSD. > > 2019-03-14 14:17:42.350902 7fe698776700 10 osd.403 103451 > _add_heartbeat_peer: new peer osd.0 10.8.78.23:6814/8498484 > 10.8.76.23:6805/8498484 > 2019-03-14 14:17:44.165460 7fe68df61700 1 -- 10.8.76.48:0/67279 --> > 10.8.76.23:6805/8498484 -- osd_ping(ping e103451 stamp 2019-03-14 > 14:17:44.165415) v4 -- 0x55844222aa00 con 0 > 2019-03-14 14:17:44.165467 7fe68df61700 20 -- 10.8.76.48:0/67279 >> > 10.8.76.23:6805/8498484 conn(0x558442368000 :-1 s=STATE_OPEN pgs=2349 > cs=1 l=1).prepare_send_message m osd_ping(ping e103451 stamp > 2019-03-14 14:17:44.165415) v4 > 2019-03-14 14:17:44.165471 7fe68df61700 20 -- 10.8.76.48:0/67279 >> > 10.8.76.23:6805/8498484 conn(0x558442368000 :-1 s=STATE_OPEN pgs=2349 > cs=1 l=1).prepare_send_message encoding features 2305244844532236283 > 0x55844222aa00 osd_ping(ping e103451 stamp 2019-03-14 14:17:44.165415) > v4 > 2019-03-14 14:17:44.165691 7fe6a574e700 5 -- 10.8.76.48:0/67279 >> > 10.8.76.23:6805/8498484 conn(0x558442368000 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2349 cs=1 l=1). rx > osd.0 seq 1 0x55844206ba00 osd_ping(ping_reply e103451 stamp > 2019-03-14 14:17:44.165415) v4 > 2019-03-14 14:17:44.165697 7fe6a574e700 1 -- 10.8.76.48:0/67279 <== > osd.0 10.8.76.23:6805/8498484 1 ==== osd_ping(ping_reply e103451 stamp > 2019-03-14 14:17:44.165415) v4 ==== 2004+0+0 (4204681659 0 0) > 0x55844206ba00 con 0x558442368000 > > ... seq 2-6... > > 2019-03-14 14:17:57.468338 7fe68df61700 1 -- 10.8.76.48:0/67279 --> > 10.8.76.23:6805/8498484 -- osd_ping(ping e103451 stamp 2019-03-14 > 14:17:57.468301) v4 -- 0x5584422e2c00 con 0 > 2019-03-14 14:17:57.468343 7fe68df61700 20 -- 10.8.76.48:0/67279 >> > 10.8.76.23:6805/8498484 conn(0x558442368000 :-1 s=STATE_OPEN pgs=2349 > cs=1 l=1).prepare_send_message m osd_ping(ping e103451 stamp > 2019-03-14 14:17:57.468301) v4 > 2019-03-14 14:17:57.468348 7fe68df61700 20 -- 10.8.76.48:0/67279 >> > 10.8.76.23:6805/8498484 conn(0x558442368000 :-1 s=STATE_OPEN pgs=2349 > cs=1 l=1).prepare_send_message encoding features 2305244844532236283 > 0x5584422e2c00 osd_ping(ping e103451 stamp 2019-03-14 14:17:57.468301) > v4 > 2019-03-14 14:17:57.468554 7fe6a574e700 5 -- 10.8.76.48:0/67279 >> > 10.8.76.23:6805/8498484 conn(0x558442368000 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2349 cs=1 l=1). rx > osd.0 seq 6 0x55844222a600 osd_ping(ping_reply e103451 stamp > 2019-03-14 14:17:57.468301) v4 > 2019-03-14 14:17:57.468561 7fe6a574e700 1 -- 10.8.76.48:0/67279 <== > osd.0 10.8.76.23:6805/8498484 6 ==== osd_ping(ping_reply e103451 stamp > 2019-03-14 14:17:57.468301) v4 ==== 2004+0+0 (306125004 0 0) > 0x55844222a600 con 0x558442368000 > 2019-03-14 14:18:04.266809 7fe6a1f89700 -1 osd.403 103451 > heartbeat_check: no reply from 10.8.76.23:6805 osd.0 ever on either > front or back, first ping sent 2019-03-14 14:17:44.165415 (cutoff > 2019-03-14 14:17:44.266808) > 2019-03-14 14:18:05.267163 7fe6a1f89700 -1 osd.403 103451 > heartbeat_check: no reply from 10.8.76.23:6805 osd.0 ever on either > front or back, first ping sent 2019-03-14 14:17:44.165415 (cutoff > 2019-03-14 14:17:45.267163) > 2019-03-14 14:18:06.267296 7fe6a1f89700 -1 osd.403 103451 > heartbeat_check: no reply from 10.8.76.23:6805 osd.0 ever on either > front or back, first ping sent 2019-03-14 14:17:44.165415 (cutoff > 2019-03-14 14:17:46.267295) > > This whole time other OSDs are marking 403 down so after the grace > period expires the monitor takes it down. > cluster [INF] osd.403 failed (root=staging,host=chhq-supcphsn34) (2 > reporters from different host after 25.757343 >= grace 21.560865) > > After about 45 seconds the OSD then stops listening on the > aforementioned ports and binds to different ones. Is this normal? > There's no obvious indication in the OSD log of why or when it does > this. > # date;ss -tlnp|grep 67279 > Thu Mar 14 14:18:27 CDT 2019 > LISTEN 0 128 10.8.78.40:6800 *:* > users:(("ceph-osd",pid=67279,fd=14)) > LISTEN 0 128 10.8.76.48:6800 *:* > users:(("ceph-osd",pid=67279,fd=13)) > LISTEN 0 128 10.8.76.48:6801 *:* > users:(("ceph-osd",pid=67279,fd=16)) > LISTEN 0 128 10.8.78.40:6801 *:* > users:(("ceph-osd",pid=67279,fd=15)) > # date;ss -tlnp|grep 67279 > Thu Mar 14 14:18:28 CDT 2019 > LISTEN 0 128 10.8.76.48:6800 *:* > users:(("ceph-osd",pid=67279,fd=13)) > LISTEN 0 128 10.8.76.48:6802 *:* > users:(("ceph-osd",pid=67279,fd=16)) > LISTEN 0 128 10.8.78.40:6802 *:* > users:(("ceph-osd",pid=67279,fd=14)) > LISTEN 0 128 10.8.78.40:6803 *:* > users:(("ceph-osd",pid=67279,fd=15)) > > But the current osdmap still shows those old ports (probably because > all communication from osd to mon has ceased until the next reportable > event?) > 103451 > osd.403 up in weight 1 up_from 103451 up_thru 0 down_at 103437 > last_clean_interval [103436,103442) 10.8.76.48:6800/67279 > 10.8.78.40:6800/67279 10.8.78.40:6801/67279 10.8.76.48:6801/67279 > exists,up f664f770-2b92-43c4-818c-db8b482515c4 > 103456 > osd.403 down in weight 1 up_from 103451 up_thru 0 down_at 103452 > last_clean_interval [103436,103442) 10.8.76.48:6800/67279 > 10.8.78.40:6800/67279 10.8.78.40:6801/67279 10.8.76.48:6801/67279 > exists f664f770-2b92-43c4-818c-db8b482515c4 > > At this point it doesn't appear I can add any OSDs to my cluster. No > existing OSDs have been flapping so I don't think network congestion > or heartbeat traffic in general is an issue. I'm hoping I've > overlooked something simple. > > Any help is appreciated! > Thanks, > Josh _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com