On 20-12-2016 16:23, Sage Weil wrote: > On Tue, 20 Dec 2016, Willem Jan Withagen wrote: >> On 20-12-2016 11:21, Willem Jan Withagen wrote: >>> Hi, >>> >>> I've been banging my head against the wall for some time now. >>> But rebinding OSD.0 (in cephtool-test-mon.sh) does not quite work. >>> >>> When rebinding it connects to the ports of OSD.1 because those ports are >>> the first not in the avoid_list. That should be refused since these >>> sockets belong to a different process. >>> UNLESS SO_REUSEPORT is set: >>> SO_REUSEPORT allows completely duplicate bindings by multiple processes >>> if they all set SO_REUSEPORT before binding the port. This option >>> permits multiple instances of a program to each receive UDP/IP >>> multicast or broadcast datagrams destined for the bound port. >>> >>> Which seems that that happens. >>> Output from sockstat in this state: >>> wjw ceph-osd-0 43305 14 tcp4 *:6800 *:* >>> wjw ceph-osd-0 43305 15 tcp4 127.0.0.1:6804 *:* >>> wjw ceph-osd-0 43305 16 tcp4 127.0.0.1:6805 *:* >>> wjw ceph-osd-0 43305 45 tcp4 127.0.0.1:6806 *:* >>> wjw ceph-osd-1 43318 14 tcp4 *:6804 *:* >>> wjw ceph-osd-1 43318 15 tcp4 *:6805 *:* >>> wjw ceph-osd-1 43318 16 tcp4 *:6806 *:* >>> wjw ceph-osd-1 43318 17 tcp4 *:6807 *:* >>> >>> Which clearly demonstrates the mess. >>> How ever that option is nowhere set in the ceph-code, neither is it a >>> setting that "just" gets set. >>> >>> Any suggestions where to look for this option to get set in an >>> incidental/bug way would be much appreciated. >>> Or a suggestion on how to easily debug this. >> >> Right, >> >> Compatibility in this area is rather thin. :) >> >> For the question skip to the end. >> >> So I'm going to need some functional description, to see if I can get it >> right. >> >> Osd starts and build a few messengers with SO_REUSEADDR on the socket. >> On Linux used ports are being reported in use. >> As on FreeBSD during startup. Ports are nicely iterated thru >> and sequential ports are selected. >> So that is how it should be. >> >> Now when the osd has gone down and comes up, it reports: >> log_channel(cluster) log [WRN] : map e18 wrongly marked me down >> on ./src/osd/OSD.cc:7120 >> >> Then it starts rebinding on its messenger connections: >> int r = cluster_messenger->rebind(avoid_ports) >> on ./src/osd/OSD.cc:7192. >> It calls shutdown_connections() to shutdown all of its connections. >> >> Somewhere down the line is SO_REUSEADDR set again on the socket and the >> socket is bound. >> - Linux grabs the next available ports at the end, because its own >> channels are to be avoided and the rest is taken. >> >> - On FreeBSD the first port available is taken. If that is 6800, >> than that is taken. Even if the socket is owned by a different >> process. Which (per man-page) would require SO_REUSEPORT. >> >> If I disable SO_REUSEADDR in NetHandler::create_socket() >> ==== >> /* Make sure connection-intensive things like the benchmark >> * will be able to close/open sockets a zillion of times */ >> if (reuse_addr) { >> if (::setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on,sizeof(on))==-1){ >> lderr(cct) << __func__ << " setsockopt SO_REUSEADDR failed: " >> << strerror(errno) << dendl; >> close(s); >> return -errno; >> } >> } >> ==== >> Then things start to work "as expected" and ports are refused when it >> has a listener connected. >> >> Doing this has the disadvantage that it is not possible to immediately >> kill and restart the OSD because the ports are not yet release in the >> netstat table.... But that is an overseeable issue, and that time can be >> shorted by setting a sysctl. >> >> So the question is: >> - how much rebinding is required..... > > I think it's just for tests. My recollection is that we did this just > because we can run out of ports since we can't reuse one until the tcp > finwait2 (or whatever) timeout expires. > >> - And why do we set SO_REUSEADDR if we are going to add the ports to >> avoid_ports. And thus a complete new port is required. > > I suspect it's safe to drop the option if the Linux vs FreeBSD semantics > are in fact different. That would be great, since it'll allow me to read up on this during the Xmas. And I'll commit a PR just excluding the code for now. That way the FreeBSD jenkins will correctly start building master again. (With the patches I have outstanding, and are seperatly applied) --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html