On Wed, 16 Dec 2020 10:24:59 PST (-0800), v.mayatskih@xxxxxxxxx wrote:
On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
I was really experting someone to say that. It does seem kind of silly to build
out the new interface, but not go all the way to a ring buffer. We just didn't
really have any way to justify the extra complexity as our use cases aren't
that high performance. I kind of like to have benchmarks for this sort of
thing, though, and I didn't have anyone who had bothered avoiding the last copy
to compare against.
I worked on something very similar, though performance was one of the
goals. The implementation was floating around lockless ring buffers,
shared memory for zerocopy, multiqueue and error handling. It could be
that every disk storage vendor has to implement something like that in
order to bridge Linux kernel to their own proprietary datapath running
in userspace.
OK, good to know. That's kind of the feeling I'd gotten from having chatted to
a handful of people about this, but I don't remember people having actually
gotten all the way to zero-copy. That's how we managed to end up at this
middle-ground ABI style: when I thought people were, in practice, punting on
zero copy because the complexity just wasn't worth the performance benefit.
Maybe I'd just been colored by how my projects ended up going, but I've ended
up designing complicated interfaces in the past that allow for zero-copy only
to never get around to actually making that work. I don't know if that's just
because I've had the good fortune to avoid working on anything that ended up
with users, though :).
For our use case I think we actually get better performance out of the
copy-based (and probably more importantly kalloc-based, but that's an
implementation thing not an ABI thing) approach: essentially we're very
sensitive to memory pressure and expect this first dm-user daemon to mostly be
idle, so we're really worried about avoiding excess memory usage while idle and
less worried about throughput when active. This stream-based interface means
that userspace doesn't need much memory allocated to service a request, which
helps with sleep/wake latencies and/or idle memory usage. That's also why we
have the simple locking scheme: no sense splitting locks if there's no
contention, and we only need a single thread to saturate the storage bandwidth
on these phones.
That said, it does sound like people really do care about the sort of
performance levels where zero copy is relevant in this space. I'll take a shot
at something along those lines, and while it will add a degree of userspace
complexity I'm not sure it'll add much in the way of kernel complexity -- at
least compared to a fast version of this, where we'd need most of that stuff
anyway (obviously the malloc+single lock design is simple, but probably
wouldn't stick around for long). At a bare minimum it'll be interesting to
play around with, but if people are doing it in practice then I'm more
confident that I can put something together that at least serves as a starting
point for further discussion.
I haven't gotten around to writing any code yet, but I had spent a bit of time
thinking about how to put this zero-copy version together and am leaning
towards it being a standalone block device (as opposed to a DM target). I'd
avoided that before as I didn't want to mess around with my own device control
scheme so I'll still try to do the DM thing, but I'm not sure it'll be viable.
That's all speculation now, but it does bring up one interesting question:
IIUC, this version of dm-user handles BIOs before they reach the block
scheduler while a standalone driver would likely handle them after blk-mq. I
don't have direct experience with this, but the last time I ran into people who
had these sorts of performance requirements for userspace drivers they weren't
actually trying to write userspace drivers but were instead trying to write a
userspace scheduler, with the userspace drivers just being the mechanism to
implement that scheduler. This was a decade ago and I'm not sure that's what
people are trying to do in the new blk-mq world, but if it is then it's going
to be a major design consideration. I'm also not entirely sure that we're
really solving the same problem at that point.