Re: newstore direction

Orit Wasserman <owasserm@xxxxxxxxxx> · Thu, 22 Oct 2015 10:51:35 +0200



On Thu, 2015-10-22 at 02:12 +0000, Allen Samuels wrote:
> One of the biggest changes that flash is making in the storage world is that the way basic trade-offs in storage management software architecture are being affected. In the HDD world CPU time per IOP was relatively inconsequential, i.e., it had little effect on overall performance which was limited by the physics of the hard drive. Flash is now inverting that situation. When you look at the performance levels being delivered in the latest generation of NVMe SSDs you rapidly see that that storage itself is generally no longer the bottleneck (speaking about BW, not latency of course) but rather it's the system sitting in front of the storage that is the bottleneck. Generally it's the CPU cost of an IOP.
> 
> When Sandisk first starting working with Ceph (Dumpling) the design of librados and the OSD lead to the situation that the CPU cost of an IOP was dominated by context switches and network socket handling. Over time, much of that has been addressed. The socket handling code has been re-written (more than once!) some of the internal queueing in the OSD (and the associated context switches) have been eliminated. As the CPU costs have dropped, performance on flash has improved accordingly.
> 
> Because we didn't want to completely re-write the OSD (time-to-market and stability drove that decision), we didn't move it from the current "thread per IOP" model into a truly asynchronous "thread per CPU core" model that essentially eliminates context switches in the IO path. But a fully optimized OSD would go down that path (at least part-way). I believe it's been proposed in the past. Perhaps a hybrid "fast-path" style could get most of the benefits while preserving much of the legacy code.
> 

+1
It not just reducing context switches but also about removing contention
and data copies and getting better cache utilization.

Scylladb just did this to cassandra (using seastar library):
http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/

Orit

> I believe this trend toward thread-per-core software development will also tend to support the "do it in user-space" trend. That's because most of the kernel and file-system interface is architected around the blocking "thread-per-IOP" model and is unlikely to change in the future.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@xxxxxxxxxxx
> 
> -----Original Message-----
> From: Martin Millnert [mailto:martin@xxxxxxxxxxx]
> Sent: Thursday, October 22, 2015 6:20 AM
> To: Mark Nelson <mnelson@xxxxxxxxxx>
> Cc: Ric Wheeler <rwheeler@xxxxxxxxxx>; Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: newstore direction
> 
> Adding 2c
> 
> On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> > My thought is that there is some inflection point where the userland
> > kvstore/block approach is going to be less work, for everyone I think,
> > than trying to quickly discover, understand, fix, and push upstream
> > patches that sometimes only really benefit us.  I don't know if we've
> > truly hit that that point, but it's tough for me to find flaws with
> > Sage's argument.
> 
> Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread:
> In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic.
> 
> Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency.
> (And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics).  Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e.
> waiting for the next distro release before being able to take up the benefits of improvements to the storage code.
> 
> A random google came up with related data on where "doing something way different" /can/ have significant benefits:
> http://phunq.net/pipermail/tux3/2015-April/002147.html
> 
> I (FWIW) certainly agree there is merit to the idea.
> The scientific approach here could perhaps be to simply enumerate all corner cases of "generic FS" that actually are cause for the experienced issues, and assess probability of them being solved (and if so when).
> That *could* improve chances of approaching consensus which wouldn't hurt I suppose?
> 
> BR,
> Martin
> 
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> NrybXǧv^)޺{.n+z]z{ayʇڙ,jfhzwj:+vwjmzZ+ݢj"!


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html