Re: newstore direction

Martin Millnert <martin@xxxxxxxxxxx> · Wed, 21 Oct 2015 23:20:25 +0200

Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland 
> kvstore/block approach is going to be less work, for everyone I think, 
> than trying to quickly discover, understand, fix, and push upstream 
> patches that sometimes only really benefit us.  I don't know if we've 
> truly hit that that point, but it's tough for me to find flaws with 
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are
further aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped
(multiple approaches exist) userland networking, which for packet
management has the benefit of - for very, very specific applications of
networking code - avoiding e.g. per-packet context switches etc, and
streamlining processor cache management performance. People have gone as
far as removing CPU cores from CPU scheduler to completely dedicate them
to the networking task at hand (cache optimizations). There are various
latency/throughput (bulking) optimizations applicable, but at the end of
the day, it's about keeping the CPU bus busy with "revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for
context switches to ever appear as a problem in themselves, certainly
for slower SSDs and HDDs. However, when going for truly high performance
IO, *every* hurdle in the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the
networking, per-packet handling characteristics).  Now, I'm not really
suggesting memory-mapping a storage device to user space, not at all,
but having better control over the data path for a very specific use
case, reduces dependency on the code that works as best as possible for
the general case, and allows for very purpose-built code, to address a
narrow set of requirements. ("Ceph storage cluster backend" isn't a
typical FS use case.) It also decouples dependencies on users i.e.
waiting for the next distro release before being able to take up the
benefits of improvements to the storage code.

A random google came up with related data on where "doing something way
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.html 

I (FWIW) certainly agree there is merit to the idea.
The scientific approach here could perhaps be to simply enumerate all
corner cases of "generic FS" that actually are cause for the experienced
issues, and assess probability of them being solved (and if so when).
That *could* improve chances of approaching consensus which wouldn't
hurt I suppose?

BR,
Martin

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html