Re: cross-core communications in seastar-osd

kefu chai <tchaikov@xxxxxxxxx> · Wed, 29 Aug 2018 15:09:58 +0800

On Mon, Aug 27, 2018 at 2:31 PM Yingxin Cheng <yingxincheng@xxxxxxxxx> wrote:
>
> On Mon, Aug 27, 2018 at 1:05 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > [snip]
> > What this really sounds like is an independent TCP connection to each
> > core.  Once you get to that point, you may as well just make an OSD per
> > core and avoid any of the weird client-visible sharding...
> >
> > It seems to me like the right model won't be clear until it is actually
> > working and we have a better handle on how much CPU we will be spending.
> > If we can drive an NVMe with one core, that would be great, but that seems
> > like a big leap from where we are now.  And it will be pretty dependent
> > on the workload.
> >
> > So either we (1) assume that we can always divvy up the storage in units
> > of 1 core and there is no crossbar in the OSD, or (2) we assume there will
> > be cases where multiple cores are applied to a single device and we have
> > one.  1 doesn't seem realistic/likely to me...
>
> Not sure if I get the points (or repeat the points). While OSD is
> still a "per-device" concept, it seems to me that we need always
> conform multi-core OSDs, and thus not possible to make an OSD per
> core.

i think it's more related to how we want to shard the PGs in the way
that are more multi-core friendly than if we want to have multi-core
OSD.

i tend to agree with Sage that if we want to eliminate the cross-bar
really depends on if the request bouncing between CPU cores is a
bottleneck or not. if it takes us lots of CPU cycles to
enqueue/dequeue the continuation of task, then the x-core
communication will be a bottleneck. and it's a good sign, IMO. as it
signifies that we are likely highly optimized already. imaging that
you take 1 hours to get the day-time job done, and other 1 hours to
commute between home and workplace. =)

take an NVMe OSD host as an example, we can have:

 - 4 NVMe SSDs
 - Intel Xeon 6152: 2 thread * 22 core @ 2.1GHz = 44 threads
 - 4 OSD per SSD = 16 OSDs in total

this is the recommended CPU spec [1] of RHCS. there would be around
2000 PGs hosted by this OSD host. if seastar-osd is able to
out-perform the existing OSD with 2x performance. we will be able to
drive 8 NVMe SSDs with the same CPU. that will be a mapping from 44
(cores) to 32 (OSDs), i.e., 1.375 : 1.  and as always, the performance
depends on the workload and access pattern. for instance, smaller i/o
is more CPU intensive.

we will see if we can have such a big leap or if the x-core
communication will be a bottleneck once the seastar-osd is in shape..

--
[1] https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/red_hat_ceph_storage_hardware_guide/index
>
> I think the real difference is to make a PG-shard per core.
> With seastar framework it's easier to implement user-space coroutines
> within each core (i.e. async and lockless), and core-local memory
> managements (i.e. shared-nothing).
> And there can be at least two design choices:
> a) Continue to use per-client-conn msgr design and then submit to
> PG-sharded core in seastar-osd;
> b) Assign or shard TCP connection to PG-sharded cores, this requires
> to implement a new pg-sharded-conn msgr design (a big leap here);
>
> > sage
> >
> >
> > >
> > > > I think 2 or 3 shards per osd is enough, so we may burst 2/3 times
> > > > connection than before. even more, we can use nic rx rss feature to
> > > > let kernel avoid cross core switch.
> > >
> > > Does "2 or 3 shards" mean 2~3 worker-threads per seastar-osd? I
> > > remember the default setting of "osd_op_num_shards_ssd" is 8 and
> > > "osd_op_num_threads_per_shard_ssd" is 2, meaning 16 worker-threads in
> > > the all-flash solution.

i think the idea of osd_op_num_shards_ssd and
osd_op_num_threads_per_shard_ssd is that, for each OSD, we can split
the hosted PGs into 8 groups, and each of the group can be served by 2
threads. to avoid lock contention between pgs, we split the pg into
groups. and because SSD is more friendly to higher queue depth than
HDD is. we assign more osd shards for SSD. our focus is to take the
full advantage the throughput of underlying storage device, and to
minimize the interactions between PGs with less overhead, i think.

but Haomai's concern is that we might end up with too many connections
because of the sharding policy suggested by him, where different ports
are assigned for different cores. "2 or 3 shard" means sharding the
PGs by ports. and all CPU shards are stilling listening on the same
port. so the number of reactor thread is always the same as the number
of CPU cores we assign to seastar.

> > >
> > > > kefu chai <tchaikov@xxxxxxxxx> 于2018年8月24日周五 下午6:07写道：
> > > > >
> > > > > this is a summary of discussion on osd-seastar we had in a meeting in this week.
> > > > >
> > > > > seastar use share-nothing design to take the advantage of multi-core
> > > > > hardware. but there are some inherent problems in OSD. in seastar-osd,
> > > > > we will have a sharded osd service listening on given port on all
> > > > > configured cores in parallel using SO_REUSEPORT, so the connections
> > > > > are evenly distributed [0] across all seastar reactors.
> > > > >
> > > > > also, in seastar-osd, to shard PGs on different cores looks like an
> > > > > intuitive design. for instance, we can
> > > > > - ensure the order of osd op to maintain a pglog
> > > > > - have better control of the io queue-depth of the storage device
> > > > > - maintain a consistent state without extra "locking" of the
> > > > > underlying ObjectStore and PG instances.
> > > > >
> > > > > but we cannot enforce a client to send requests to a single PG, or the
> > > > > PGs which happen to be hosted by the core which accepts the connection
> > > > > from this client. so i think we can only have a run-to-completion
> > > > > session for a request chain which is targeting a certain PG, and
> > > > > forward the client to whichever the PG it wants to talk to. this
> > > > > cross-core communication is inevitable, i think.
> > > > >
> > > > > to avoid starving low traffic connection by high traffic client on a
> > > > > certain core, we use the `Throttle` attached to each connection. see
> > > > > SocketConnection::maybe_throttle().
> > > > >
> > > > > ---
> > > > >
> > > > > [0] https://lwn.net/Articles/542629/
> > > > > --
> > > > > Regards
> > > > > Kefu Chai
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Yingxin
> > >
> > >
>
> --
> Regards,
> Yingxin

-- 
Regards
Kefu Chai