On 2019-01-09 5:30 p.m., Matt Benjamin wrote:
On Tue, Jan 8, 2019 at 8:32 PM Radoslaw Zarzynski <rzarzyns@xxxxxxxxxx> wrote:
<snipped makes-sense-to-me-stuff>
Personally I perceive the OSD *concept* as networked ObjectStore instance
exposed over the RADOS protocol.
I remain concerned that this is framing is too strong. Recall that
well before the seastar-osd concept, several teams (mellanox, folks on
my team, Fujitsu/Piotr, and I think by Sam) have asked to flex in the
other direction--mapping a reduced number of network connections to
OSDs.
That's still the case. I have a test cluster consisting of 6 hosts, 66 osd
in total, 1536 PGs. How many connections are maintained by ceph-osd
processes on a host? Almost 1700:
# netstat -ntp | grep -c ceph-osd
1650
How many connections are maintained by randomly taken ceph-osd process?
# netstat -ntp | grep -c 354420/ceph-osd
158
The problem remains, even if it's reduced by transition to async messenger
(less threads and less cpu time wasted for context switches) and by
transition to all-NVMe clusters that by definition pack less OSDs per host
(usually 2-3 per NVMe).
When infiniband rc is the transport with Mellanox connect-x3 or
-x4, each reliable connection consumes 1 queue pair, and there there
are 64K -total- qps available on the hca. Solutions are in the
direction of ud or hybridizing with shared receive-queue. I'm not
arguing message-passing/datagram orientation should somehow take
precedence, but I think we need to make space for those setups in what
we design now. Taking any incidence of cross-core communication as an
intolerable event feels problematic for that?
+1 for going with datagram/connectionless. Heartbeats alone could be moved
to connectionless communication and that would already help a lot.
--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovhcloud.com