Re: single-threaded seastar-osd

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Thu, 10 Jan 2019 09:07:42 +0100

On 2019-01-09 5:30 p.m., Matt Benjamin wrote:
On Tue, Jan 8, 2019 at 8:32 PM Radoslaw Zarzynski <rzarzyns@xxxxxxxxxx> wrote:

<snipped makes-sense-to-me-stuff>

Personally I perceive the OSD *concept* as networked ObjectStore instance
exposed over the RADOS protocol.

I remain concerned that this is framing is too strong.  Recall that
well before the seastar-osd concept, several teams (mellanox, folks on
my team, Fujitsu/Piotr, and I think by Sam) have asked to flex in the
other direction--mapping a reduced number of network connections to
OSDs.  

That's still the case. I have a test cluster consisting of 6 hosts, 66 osd 
in total, 1536 PGs. How many connections are maintained by ceph-osd 
processes on a host? Almost 1700:

# netstat -ntp | grep -c ceph-osd
1650

How many connections are maintained by randomly taken ceph-osd process?

# netstat -ntp | grep -c 354420/ceph-osd
158

The problem remains, even if it's reduced by transition to async messenger 
(less threads and less cpu time wasted for context switches) and by 
transition to all-NVMe clusters that by definition pack less OSDs per host 
(usually 2-3 per NVMe).

When infiniband rc is the transport with Mellanox connect-x3 or
-x4, each reliable connection consumes 1 queue pair, and there there
are 64K -total- qps available on the hca.  Solutions are in the
direction of ud or hybridizing with shared receive-queue.  I'm not
arguing message-passing/datagram orientation should somehow take
precedence, but I think we need to make space for those setups in what
we design now.  Taking any incidence of cross-core communication as an
intolerable event feels problematic for that?

+1 for going with datagram/connectionless. Heartbeats alone could be moved 
to connectionless communication and that would already help a lot.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovhcloud.com