On Tue, 20 Dec 2016, Willem Jan Withagen wrote: > On 20-12-2016 16:23, Sage Weil wrote: > > On Tue, 20 Dec 2016, Willem Jan Withagen wrote: > >> On 20-12-2016 11:21, Willem Jan Withagen wrote: > >>> Hi, > >>> > >>> I've been banging my head against the wall for some time now. > >>> But rebinding OSD.0 (in cephtool-test-mon.sh) does not quite work. > >>> > >>> When rebinding it connects to the ports of OSD.1 because those ports are > >>> the first not in the avoid_list. That should be refused since these > >>> sockets belong to a different process. > >>> UNLESS SO_REUSEPORT is set: > >>> SO_REUSEPORT allows completely duplicate bindings by multiple processes > >>> if they all set SO_REUSEPORT before binding the port. This option > >>> permits multiple instances of a program to each receive UDP/IP > >>> multicast or broadcast datagrams destined for the bound port. > >>> > >>> Which seems that that happens. > >>> Output from sockstat in this state: > >>> wjw ceph-osd-0 43305 14 tcp4 *:6800 *:* > >>> wjw ceph-osd-0 43305 15 tcp4 127.0.0.1:6804 *:* > >>> wjw ceph-osd-0 43305 16 tcp4 127.0.0.1:6805 *:* > >>> wjw ceph-osd-0 43305 45 tcp4 127.0.0.1:6806 *:* > >>> wjw ceph-osd-1 43318 14 tcp4 *:6804 *:* > >>> wjw ceph-osd-1 43318 15 tcp4 *:6805 *:* > >>> wjw ceph-osd-1 43318 16 tcp4 *:6806 *:* > >>> wjw ceph-osd-1 43318 17 tcp4 *:6807 *:* > >>> > >>> Which clearly demonstrates the mess. > >>> How ever that option is nowhere set in the ceph-code, neither is it a > >>> setting that "just" gets set. > >>> > >>> Any suggestions where to look for this option to get set in an > >>> incidental/bug way would be much appreciated. > >>> Or a suggestion on how to easily debug this. > >> > >> Right, > >> > >> Compatibility in this area is rather thin. :) > >> > >> For the question skip to the end. > >> > >> So I'm going to need some functional description, to see if I can get it > >> right. > >> > >> Osd starts and build a few messengers with SO_REUSEADDR on the socket. > >> On Linux used ports are being reported in use. > >> As on FreeBSD during startup. Ports are nicely iterated thru > >> and sequential ports are selected. > >> So that is how it should be. > >> > >> Now when the osd has gone down and comes up, it reports: > >> log_channel(cluster) log [WRN] : map e18 wrongly marked me down > >> on ./src/osd/OSD.cc:7120 > >> > >> Then it starts rebinding on its messenger connections: > >> int r = cluster_messenger->rebind(avoid_ports) > >> on ./src/osd/OSD.cc:7192. > >> It calls shutdown_connections() to shutdown all of its connections. > >> > >> Somewhere down the line is SO_REUSEADDR set again on the socket and the > >> socket is bound. > >> - Linux grabs the next available ports at the end, because its own > >> channels are to be avoided and the rest is taken. > >> > >> - On FreeBSD the first port available is taken. If that is 6800, > >> than that is taken. Even if the socket is owned by a different > >> process. Which (per man-page) would require SO_REUSEPORT. > >> > >> If I disable SO_REUSEADDR in NetHandler::create_socket() > >> ==== > >> /* Make sure connection-intensive things like the benchmark > >> * will be able to close/open sockets a zillion of times */ > >> if (reuse_addr) { > >> if (::setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on,sizeof(on))==-1){ > >> lderr(cct) << __func__ << " setsockopt SO_REUSEADDR failed: " > >> << strerror(errno) << dendl; > >> close(s); > >> return -errno; > >> } > >> } > >> ==== > >> Then things start to work "as expected" and ports are refused when it > >> has a listener connected. > >> > >> Doing this has the disadvantage that it is not possible to immediately > >> kill and restart the OSD because the ports are not yet release in the > >> netstat table.... But that is an overseeable issue, and that time can be > >> shorted by setting a sysctl. > >> > >> So the question is: > >> - how much rebinding is required..... > > > > I think it's just for tests. My recollection is that we did this just > > because we can run out of ports since we can't reuse one until the tcp > > finwait2 (or whatever) timeout expires. > > > >> - And why do we set SO_REUSEADDR if we are going to add the ports to > >> avoid_ports. And thus a complete new port is required. > > > > I suspect it's safe to drop the option if the Linux vs FreeBSD semantics > > are in fact different. > > When I exclude the SO_REUSEADDR my Jenkins goes back to normal. > Will submit a PR. Please #ifdef it so it's only excluded for FreeBSD. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html