Re: single-threaded seastar-osd

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 9 Jan 2019 07:25:25 -0600

On 1/8/19 7:31 PM, Radoslaw Zarzynski wrote:

On Tue, Jan 8, 2019 at 12:43 AM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
I want to know what an OSD means in this context.
Let me start with bringing more context on how the concept was born.
It came just from the observation that vendors tend to deploy multiple
ceph-osd daemons on a single NVMe device in their performance testing.
It's not unusual to see 48 physical cores serving 10 NVMes with 2 OSDs
on each like in the Micron's document [1]. This translates into 2.4
physical cores per ceph-osd.

That's probably understating it to be honest.  We've been able to hit 
7-8 Xeon cores on last gen NVMe drives when they are pushed really hard. 
The idea of reducing complexity in the OSD to help isolate and remove 
overhead is *very* attractive so I know where you are coming from.

The proposed design explores following assumption: if the current RADOS
infrastructure was able to withstand the resource (connections, osdmap)
inflation in such scenarios, it likely can absorb several times more.
Ensuring we truly have the extra capacity is *crucial* requirement.

Personally I perceive the OSD *concept* as networked ObjectStore instance
exposed over the RADOS protocol.

Back when I was playing around with my petstore code I had similar 
thoughts.  I didn't realize how much of the filestore, memstore, and 
bluestore design was being forced by the objectstore interface.  The 
more I played with petstore, the more I started thinking about what a 
simple single-threaded (non-seastar) version of the OSD might look like 
instead.  I was more thinking along the lines of a toy reference 
implementation than as the thing we actually deployed in reality (though 
I had hopes that if it worked out perhaps it might scale in the same way 
you imagine, by throwing lots of them at a single device).  It certainly 
makes the OSD a lot simpler, and I won't argue there isn't value in 
that. If I look at my own motives for wanting to go down this path 
though, I'm pretty sure I liked it because it's easy and at least at 
small scale probably provides a lot of "bang for the buck" (at least so 
long as the rest of the architecture holds up to more OSDs).

How should a user
think about it?  How should the user think about the governing process?
No different than in the current deployment scenario where multiple OSDs
are spanning the same physical device. OSD would no longer bound to a disk
but rather to a partition.

Josh rightly pointed out to me that when you get right down to it, an
OSD as it exists today is a failure domain.  That's still true here, but
these OSDs seem a lot more like storage shards that theoretically exist
as separate failure domains but for all practical purposes act as
groups.
In addition to being leaf entity of the failure domain division, I think
OSD is also an entity of the RADOS name resolution (I see RADOS resolver
as a component responsible for translating pool/object name into a tuple,
with ip and port inside, constituting straight path to an ObjectStore).

As these concepts are currently glued altogether, the vendors' strategy to
increase the number of resolution entities is being reflected by exposing
the physical disk partitioning in e.g. `osd tree` output. This has its own
functional traits. Surely, more complex deployment is a downside.
However, aren't such activities supposed to be hidden by Ansible/Rook/*?

IE are there good architectural reasons to map failure domains
down to "cores" rather than "disks"?  I think we want this because it's
convenient that each OSD shard would have it's own msgr and heartbeat
services and we can avoid cross-core communication.  It might even be
the right decision practically, but I'm not sure that conceptually it
really makes a lot of sense to me.
Conceptually we would still map to ObjectStore instance, not "core".
The fact it can be (and even currently is!) laid down on a block device
being a derivate of another block device looks like an implementation
detail of our deployment process. I'm afraid that mapping failure domain
to "disk" was fuzzy even before the NVMe era -- with FileStore consuming
single HDD + a "partition" of shared SSD

I totally agree regarding what's happened with HDDs and SSDs, and to a 
certain extent there's always been the case of power supplies, 
motherboards, network cards, etc.  At least with those other things it 
sort of makes sense as far as the physical world goes.  If an entire 
server goes down and OSDs == Disks, it makes sens that multiple OSDs go 
down.  Maybe we need to provide some kind of special sauce to show the 
relationships and failure modes that affect different groups of OSDs.

One of the fundamental benefits I see is keeping the RADOS name resolver
intact. It still consists one level only: the CRUSH name resolution. No
in-OSD crossbar is necessary. Therefore I expect no desire for a RADOS
extension bypassing the new stage by memorizing the mapping it brings.
That is, in addition to simplifying the crimson-osd design (stripping all
seastar::sharded<...> and seastar::foreign_ptrs), there would be absolutely
no modification to the protocol and clients. This means no need for a logic
handling backward compatibility.

It's a fair point.  To also play devil's advocate: If you are storing
cache per OSD and the size of each cache grows with the number of OSDs,
what happens as the number of cores / node grows? Maybe we are ok with
current core counts.  Would we still be ok with 256+ cores in a single
node if the number of caches and the size of each cache grows together?
Well, osdmap uses a dedicated mempool. FWIW, local testing and grepping
ceph-users for mempool_dumps suggest the cache stays in hundreds of KBs
range. The (rough!) testing also shows linear growth with the number of
OSDs. Still, even tens of MBs/cache instance might be acceptable as:
   * economy class (HDDs) would likely use single OSD/single disk -- no
     regression from what we have right now.
   * High-end already deploys multiple OSDs/device and memory is rather
     little concern -- just like in already pointed out case of powerful
     enough switches/network infrastructure.

I confess I don't regularly look at the size of the osdmap cache since 
it's usually dwarfed in my (single OSD) testing by pglog and 
bluestore/rocksdb caches.  How many OSDs were in your cache when you 
checked?  I'd be curious what the numbers would look like if you have 
something like 1000 nodes with 200 cores/node.  That's probably not an 
unreasonable target given we already have options with 56+ threads / socket.

Regards,
Radek

[1] Micron ® 9200 MAX NVMeTM SSDs + Red Hat ® Ceph Storage 3.0,
Reference Architecture