On 08/29/2018 02:09 AM, kefu chai wrote:
On Mon, Aug 27, 2018 at 2:31 PM Yingxin Cheng <yingxincheng@xxxxxxxxx> wrote:
On Mon, Aug 27, 2018 at 1:05 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
[snip]
What this really sounds like is an independent TCP connection to each
core. Once you get to that point, you may as well just make an OSD per
core and avoid any of the weird client-visible sharding...
It seems to me like the right model won't be clear until it is actually
working and we have a better handle on how much CPU we will be spending.
If we can drive an NVMe with one core, that would be great, but that seems
like a big leap from where we are now. And it will be pretty dependent
on the workload.
So either we (1) assume that we can always divvy up the storage in units
of 1 core and there is no crossbar in the OSD, or (2) we assume there will
be cases where multiple cores are applied to a single device and we have
one. 1 doesn't seem realistic/likely to me...
Not sure if I get the points (or repeat the points). While OSD is
still a "per-device" concept, it seems to me that we need always
conform multi-core OSDs, and thus not possible to make an OSD per
core.
i think it's more related to how we want to shard the PGs in the way
that are more multi-core friendly than if we want to have multi-core
OSD.
i tend to agree with Sage that if we want to eliminate the cross-bar
really depends on if the request bouncing between CPU cores is a
bottleneck or not. if it takes us lots of CPU cycles to
enqueue/dequeue the continuation of task, then the x-core
communication will be a bottleneck. and it's a good sign, IMO. as it
signifies that we are likely highly optimized already. imaging that
you take 1 hours to get the day-time job done, and other 1 hours to
commute between home and workplace. =)
take an NVMe OSD host as an example, we can have:
- 4 NVMe SSDs
- Intel Xeon 6152: 2 thread * 22 core @ 2.1GHz = 44 threads
- 4 OSD per SSD = 16 OSDs in total
this is the recommended CPU spec [1] of RHCS. there would be around
2000 PGs hosted by this OSD host. if seastar-osd is able to
out-perform the existing OSD with 2x performance. we will be able to
drive 8 NVMe SSDs with the same CPU. that will be a mapping from 44
(cores) to 32 (OSDs), i.e., 1.375 : 1. and as always, the performance
depends on the workload and access pattern. for instance, smaller i/o
is more CPU intensive.
we will see if we can have such a big leap or if the x-core
communication will be a bottleneck once the seastar-osd is in shape..
--
[1] https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/red_hat_ceph_storage_hardware_guide/index
For small random writes, I think we are primarily limited in bluestore
right now by bstore_kv_sync, though certainly we have significant CPU
usage in other threads on other cores as well. To drive ~22-23K IOPS
takes about 8 2.3GHz xeon cores. On my AMD Ryzen desktop, I can get
about 14K IOPS with 4-5 3.0GHz cores. I think it's going to be pretty
tough to keep an OSD on one core unless we really work to decrease CPU
utilization across the whole stack. It might not be impossible, but at
the least it's going to require careful design all the way down to the
KeyValueDB.
Mark
I think the real difference is to make a PG-shard per core.
With seastar framework it's easier to implement user-space coroutines
within each core (i.e. async and lockless), and core-local memory
managements (i.e. shared-nothing).
And there can be at least two design choices:
a) Continue to use per-client-conn msgr design and then submit to
PG-sharded core in seastar-osd;
b) Assign or shard TCP connection to PG-sharded cores, this requires
to implement a new pg-sharded-conn msgr design (a big leap here);
sage
I think 2 or 3 shards per osd is enough, so we may burst 2/3 times
connection than before. even more, we can use nic rx rss feature to
let kernel avoid cross core switch.
Does "2 or 3 shards" mean 2~3 worker-threads per seastar-osd? I
remember the default setting of "osd_op_num_shards_ssd" is 8 and
"osd_op_num_threads_per_shard_ssd" is 2, meaning 16 worker-threads in
the all-flash solution.
i think the idea of osd_op_num_shards_ssd and
osd_op_num_threads_per_shard_ssd is that, for each OSD, we can split
the hosted PGs into 8 groups, and each of the group can be served by 2
threads. to avoid lock contention between pgs, we split the pg into
groups. and because SSD is more friendly to higher queue depth than
HDD is. we assign more osd shards for SSD. our focus is to take the
full advantage the throughput of underlying storage device, and to
minimize the interactions between PGs with less overhead, i think.
but Haomai's concern is that we might end up with too many connections
because of the sharding policy suggested by him, where different ports
are assigned for different cores. "2 or 3 shard" means sharding the
PGs by ports. and all CPU shards are stilling listening on the same
port. so the number of reactor thread is always the same as the number
of CPU cores we assign to seastar.
kefu chai <tchaikov@xxxxxxxxx> 于2018年8月24日周五 下午6:07写道:
this is a summary of discussion on osd-seastar we had in a meeting in this week.
seastar use share-nothing design to take the advantage of multi-core
hardware. but there are some inherent problems in OSD. in seastar-osd,
we will have a sharded osd service listening on given port on all
configured cores in parallel using SO_REUSEPORT, so the connections
are evenly distributed [0] across all seastar reactors.
also, in seastar-osd, to shard PGs on different cores looks like an
intuitive design. for instance, we can
- ensure the order of osd op to maintain a pglog
- have better control of the io queue-depth of the storage device
- maintain a consistent state without extra "locking" of the
underlying ObjectStore and PG instances.
but we cannot enforce a client to send requests to a single PG, or the
PGs which happen to be hosted by the core which accepts the connection
from this client. so i think we can only have a run-to-completion
session for a request chain which is targeting a certain PG, and
forward the client to whichever the PG it wants to talk to. this
cross-core communication is inevitable, i think.
to avoid starving low traffic connection by high traffic client on a
certain core, we use the `Throttle` attached to each connection. see
SocketConnection::maybe_throttle().
---
[0] https://lwn.net/Articles/542629/
--
Regards
Kefu Chai
--
Regards,
Yingxin
--
Regards,
Yingxin