Re: OSD rebind connects to ports of other OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 20-12-2016 16:23, Sage Weil wrote:
> On Tue, 20 Dec 2016, Willem Jan Withagen wrote:
>> On 20-12-2016 11:21, Willem Jan Withagen wrote:
>>> Hi,
>>>
>>> I've been banging my head against the wall for some time now.
>>> But rebinding OSD.0 (in cephtool-test-mon.sh) does not quite work.
>>>
>>> When rebinding it connects to the ports of OSD.1 because those ports are
>>> the first not in the avoid_list. That should be refused since these
>>> sockets belong to a different process.
>>> UNLESS SO_REUSEPORT is set:
>>>  SO_REUSEPORT allows completely duplicate bindings by multiple processes
>>>  if they all set SO_REUSEPORT before binding the port.  This option
>>>  permits multiple instances of a program to each receive UDP/IP
>>>  multicast or broadcast datagrams destined for the bound port.
>>>
>>> Which seems that that happens.
>>> Output from sockstat in this state:
>>> wjw      ceph-osd-0   43305 14 tcp4   *:6800                *:*
>>> wjw      ceph-osd-0   43305 15 tcp4   127.0.0.1:6804        *:*
>>> wjw      ceph-osd-0   43305 16 tcp4   127.0.0.1:6805        *:*
>>> wjw      ceph-osd-0   43305 45 tcp4   127.0.0.1:6806        *:*
>>> wjw      ceph-osd-1   43318 14 tcp4   *:6804                *:*
>>> wjw      ceph-osd-1   43318 15 tcp4   *:6805                *:*
>>> wjw      ceph-osd-1   43318 16 tcp4   *:6806                *:*
>>> wjw      ceph-osd-1   43318 17 tcp4   *:6807                *:*
>>>
>>> Which clearly demonstrates the mess.
>>> How ever that option is nowhere set in the ceph-code, neither is it a
>>> setting that "just" gets set.
>>>
>>> Any suggestions where to look for this option to get set in an
>>> incidental/bug way would be much appreciated.
>>> Or a suggestion on how to easily debug this.
>>
>> Right,
>>
>> Compatibility in this area is rather thin. :)
>>
>> For the question skip to the end.
>>
>> So I'm going to need some functional description, to see if I can get it
>> right.
>>
>> Osd starts and build a few messengers with SO_REUSEADDR on the socket.
>>         On Linux used ports are being reported in use.
>> 	As on FreeBSD during startup. Ports are nicely iterated thru
>> 	and sequential ports are selected.
>> So that is how it should be.
>>
>> Now when the osd has gone down and comes up, it reports:
>>   log_channel(cluster) log [WRN] : map e18 wrongly marked me down
>> on ./src/osd/OSD.cc:7120
>>
>> Then it starts rebinding on its messenger connections:
>>         int r = cluster_messenger->rebind(avoid_ports)
>> on ./src/osd/OSD.cc:7192.
>> It calls shutdown_connections() to shutdown all of its connections.
>>
>> Somewhere down the line is SO_REUSEADDR set again on the socket and the
>> socket is bound.
>>  - Linux grabs the next available ports at the end, because its own
>>    channels are to be avoided and the rest is taken.
>>
>>  - On FreeBSD the first port available is taken. If that is 6800,
>>    than that is taken. Even if the socket is owned by a different
>>    process. Which (per man-page) would require SO_REUSEPORT.
>>
>> If I disable SO_REUSEADDR in NetHandler::create_socket()
>> ====
>>   /* Make sure connection-intensive things like the benchmark
>>    * will be able to close/open sockets a zillion of times */
>>   if (reuse_addr) {
>>     if (::setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on,sizeof(on))==-1){
>>       lderr(cct) << __func__ << " setsockopt SO_REUSEADDR failed: "
>>                  << strerror(errno) << dendl;
>>       close(s);
>>       return -errno;
>>     }
>>   }
>> ====
>> Then things start to work "as expected" and ports are refused when it
>> has a listener connected.
>>
>> Doing this has the disadvantage that it is not possible to immediately
>> kill and restart the OSD because the ports are not yet release in the
>> netstat table.... But that is an overseeable issue, and that time can be
>> shorted by setting a sysctl.
>>
>> So the question is:
>>  - how much rebinding is required.....
> 
> I think it's just for tests.  My recollection is that we did this just 
> because we can run out of ports since we can't reuse one until the tcp 
> finwait2 (or whatever) timeout expires.
> 
>>  - And why do we set SO_REUSEADDR if we are going to add the ports to
>>  	avoid_ports. And thus a complete new port is required.
> 
> I suspect it's safe to drop the option if the Linux vs FreeBSD semantics 
> are in fact different.

That would be great, since it'll allow me to read up on this during the
Xmas. And I'll commit a PR just excluding the code for now.
That way the FreeBSD jenkins will correctly start building master again.
(With the patches I have outstanding, and are seperatly applied)

--WjW


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux