Re: RFC: [Restatement] KBUS messaging subsystem

Tony Ibbs <tibs@xxxxxxxxxxxxxx> · Mon, 29 Aug 2011 09:55:49 +0100

On 22 Aug 2011, at 02:15, Bryan Donlan wrote:

> I think this may well be the core problem here - is KBUS, as proposed,
> a general API lots of people will find useful, or is it something that
> will fit _your_ use case well, but other use cases poorly?

Indeed.

And, by the way, thanks a lot for this email, which gives me lots of
specific items to reply to.

> Designing a good API, of course, is quite difficult, but it _must_ be
> done before integrating anything with upstream Linux, as once
> something is merged it has to be supported for decades, even if it
> turns out to be useless for 99% of use cases.

Indeed.

It's only anecdotal evidence, of course, but we have put quite a lot
of thought and testing into KBUS - many things that look as if they
are going to be simple or easy either aren't, or lead to unfortunate
consequences.

> Some good questions to ask might be:
> * Does this system play nice with namespaces?
> * What limits are in place to prevent resource exhaustion attacks?
> * Can libdbus or other such existing message brokers swap out their
> existing central-routing-process based communications with this new
> system without applications being aware?

I'll punt on namespaces, since I don't know the terminology being used
in the kernel (there seem to be several sorts of namespace, which are
essentially independent?).

KBUS at the moment is definitely not doing enough to manage its
resources, as I've said elsewhere. However, having all of the queues
in the same place, under the same management, means that it is a
relatively simple job to enforce an overall limit on the memory usage
of a particular bus, as well as per-queue and per-message limits. This
will (eventually) get mended whether KBUS ends up in the kernel or
not.

As to higher-level messaging systems using KBUS, I think that's a red
herring. For a start, I can't see why they'd necessarily be interested
- presumably if they felt the need for such a kernel module they'd
already have moved to introduce one (as in binder, for instance). If
they haven't, it's presumably because their design works well enough
(for their own aims) without. And of course they'd lose a certain
amount of control if part of their system were kernel-maintained,
which might also be important. But also, it's a significant
development project in itself to try to produce a system suitable to
act as the underpinnings for another system. One can't just say "it
looks as if it might work", one needs to implement it and test it,
because of all the edge cases one is bound not to have thought of.
That's a lot of work, and almost entirely unrelated to producing a
simple, minimal system.

  (for what it's worth, and despite that, my gut feeling is that
  any useful minimal messaging system could be used as the
  bottom-level for a libdbus or equivalent, but I'm still not
  convinced it would be worth it, and it would not necessarily give a
  better version of the higher level system)

So I'd say if libdbus or whoever *could* use such a system, that would
be nice, but it should not be the primary aim.

> Keep in mind also that the kernel API need not match the
> application-visible API, if you can add a userspace library to
> translate to the API you want.

OK, although that's basically true of all APIs (for instance, the way
I think of KBUS in the privacy of my own head is with the API I use in
the Python library, or how the message queues actually work within
KBUS itself).

If you look at the existing C and C++ APIs, they provide two somewhat
different abstractions. The Javascript APIs we've used in the past
were even further away from the actual kernel APIs (they never
mentioned a particular bus, since that could be inferred from the
message name in that application).

On the other hand, it is incumbent on us to remember that people
programming to the kernel API are users as well. So your implicit
point (as I take it) in this message that one should use familiar
interfaces in a familiar way is a good one. On the other hand, if we
can present the user with a simpler interface (as in simpler to
program with) by a relatively small amount of underlying work, then
that is a net gain - the user is less likely to make mistakes, and the
overall amount of code written will be smaller.

I think there is a general principle at work, in that one should
solve difficult problems once, in one place, if at all possible.

> So, for example, instead of numbering
> kbuses, you could define them as a new AF_UNIX protocol, and place
> them in the abstract socket namespace (ie, they'd have names like
> "\0kbus-0").

Indeed. I'd regard that as cosmetic detail - each KBUS is still
identified by a number, but instead of that number being used in a
device name, it's being used in a socket name.

> Doing something like this avoids creating a new
> namespace, and non-embedded devices could place these new primitives
> in a tmpfs or other more visible location. It also makes it very cheap
> (and a non-privileged operation!) to create kbuses.

Hmm.

The current mechanism for creating new KBUS buses as an unprivileged
user is admittedly via an ioctl, but clearly that should be replaced
by something more modern (the received wisdom on how to use things
like debugfs, for instance, seems to have changed greatly even in
KBUS's short life). It's not something that the current KBUS model
requires one to do often, so cheapness is not a great issue.

But unprivileged is good.

> So, let's look at your requirements:
> 
> * Message broadcast API with prefix filtering
> * Deterministic ordering
> * Possible to snoop on all messages being passed through
> * Must not require any kind of central userspace daemon
> * Needs a race-less way of 1) Advertising (and locking) as a replier
> for a particular message type and 2) Detecting when the replier dies
> (and synthesizing error replies in this event)
> 
> Now, to minimize this definition, why not remove prefix filtering from
> the kernel? For low-volume buses, it doesn't hurt to do the filtering
> in userspace (right?). If you want to reduce the volume of messages
> received, do it on a per-bus granularity (and set up lots of buses
> instead). After all, you can always connect to multiple buses if you
> need to listen for multiple message types. For replier registration,
> then, it would be done on a per-bus granularity, not a per-message
> granularity.
> 
> So we now have an API that might (as an example) look like this:
> 
> * Creation of buses - socket(AF_UNIX, SOCK_DGRAM, PROTO_KBUS),
> followed by bind() either to a file or in the abstract namespace
> * Advertising as a replier on a socket - setsockopt(SOL_KBUS,
> KBUS_REPLIER, &one); - returns -EEXIST if a replier is already present
> * Sending/receiving messages - ordinary sendto/recvfrom. If a reply is
> desired, use sendmsg with an ancillary data item indicating a reply is
> desired
> * Notification on replier death (or replier buffer overflow etc):
> empty message with ancillary data attached informing of the error
> condition
> * 64-bit global counter on all messages (or messages where requested
> by the client) to give a deterministic order between messages sent on
> multiple buses (reported via ancillary data)
> * Resource limitation based on memory cgroup or something? Not sure
> what AF_UNIX uses already, but you could probably use the same system.
> * Perhaps support SCM_RIGHTS/SCM_CREDENTIALS transfers as well?

Thanks a lot for concrete call examples - that makes it a lot easier
for me to think things through. I'll try to separate out my comments
into some sort of sensible sequence. Forgive me if I miss something.

Current scheme
--------------
In the current system, message sending looks something like the
following:

1. Sender opens bus 0
2. Sender creates a message with name "Fred"
3. Sender may mark the message as needing a reply.
4. Sender writes the message and sends it (these are currently two
   operations, but as was mentioned upstream, could be combined - we
   just liked them better apart).
5. If the message needs a reply, KBUS checks if the sender has enough
   space in its queue to receive a reply, and if not, rejects it
5. KBUS assigns the next message id for bus 0 to the message
6. KBUS determines who should receive the message. If the message
   needs a reply, and no-one has bound as replier, then the send
   fails. Similarly, if the replier does not have room in their queue,
   the send fails.
7. Otherwise, KBUS copies the message to the queues of everyone who
   should receive it. If the message needs a reply, then the
   header of the particular message that needs a reply will be altered
   to indicate this.

At the recipient end, the sequence is something like:

1. Listener opens bus 0.
2. Listener chooses to receive messages with a particular name,
   possibly as a replier.
3. Listener determines if there is a next message, and if so, its
   length.
4. Listener allocates a buffer to receive the message.
5. Listener reads the message into the buffer.

The recipient is guaranteed to read messages in the order they were
sent in, and to only get the messages they asked for.

Sockety scheme
--------------
In the scheme where we're just replacing the calls with appropriate
"sockety" calls, and not altering message name filtering, this
presumably proceeds in a very similar manner, except that we are using
the equivalent sockety calls.

My first question would be how the recipient is meant to tell the
length of the next message before doing their recvfrom/recvmsg.

I realise (now) that "message data may vary in length" wasn't
mentioned up front as a requirement (although I'd aver that it is
pretty evident from the documentation, and from the API proposed, that
this is meant, else why do we have NEXTMSG returning the length of the
next message?). But then I'd never imagined that someone wouldn't
assume this as a property of a general messaging system (after all,
it is clearly simple to build a fixed length message system on top of
a variable length message system, and rather harder to do the
reverse).

Can one use MSG_PEEK to retrieve just the ancillary data? It's not
clear to me from the recvmsg man page. If message data length was sent
in the ancillary data, then this could work. If one can't do that,
perhaps one could use MSG_PEEK to look at the start of the message
proper (although that feels like a horrible hack). Is there precedent
for this?

Otherwise, we're reduced to a special call of getsockopt, or, worse,
separating all messages into a standard sized header message followed
by the message data - but that way lies insanity.

There's also a decision to be made of what does get put into ancillary
data. At one extreme, all of the message header data would be treated
as ancillary data, which means that the user would need to use
sendmsg/recvmsg all of the time. That's a lot more code complexity,
and a lot more allocations. At the other extreme, we don't use
ancillary data at all, in which case we keep the header more-or-less
as is. There is the whole issue of whether message name and data are
referred to as pointers from the header, or are part of the same
buffer (there's some discussion of this in the KBUS documentation,
where it talks about "pointy" and "entire" messages). But that's a
level of detail for later, if necessary.

Also, if we're using ancillary data, can we use a socket specific
method to identify the message sender, something one can feed straight
back into sendmsg (hmm, maybe not - a quick scan around suggests that
AF_UNIX only has SCM_CREDENTIALS and SCM_RIGHTS - maybe I've not
looked hard enough).

  (KBUS's current sender id is nice and simple, but I'd assumed there
  must be some sockety equivalent we should be using...)

Regardless, we have an API comparison something like:

========================            =============================
File-oriented                       Socket-oriented
========================            =============================
open                                socket
close                               close
write [and <send> ioctl]            sendmsg or sendto
<nextmsg> ioctl                     not clear - getsockopt? peek?
read                                recvmsg or recvfrom
<bind> ioctl                        setsockopt
<unbind> ioctl                      setsockopt
poll                                poll
========================            =============================

There are also various ioctls on the file-oriented side that would
clearly be  replaced by get/setsockopt, and more that should be direct
instructions to KBUS via debugfs or something (i.e., they should never
have been ioctls in the first place, if I'd know what to do instead).

> This is a much simpler kernel API, don't you think?

I think we mean different things by that.

Replacing read/write (which are, let's face it, quite simple to use,
and just about every C programmer can get them mostly right) with
sendmsg/recvmsg (which are some of the most complicated calls to use
in the socket world, and for which most easy to find examples are
about moving file descriptors between processes) does not seem to me
to be simplifying anything.

I must admit I'm also not entirely sure why get/setsockopt calls are
*that* much better than ioctls (they do at least specify a length, and
the number of existing options is smaller, but neither of those seems
an obvious win to me).

Regardless, though, assuming the message length problem can be sorted
out (and that's obviously possible by *some* means), it is clearly
feasible to replace one API with another, and I assume one could move
the innards of the current KBUS to talk to the new interface.

Filtering in userspace
----------------------
You suggest performing message filtering in userspace, by reading all
messages and "throwing away" those which are not of interest. This is
predicated on the idea that the data is low volume. Apart from the
fact that I'm not sure what low volume means (are we contrasting
with network traffic on an STB handling audio/video?), we've tried not
to assume anything much about the amount of traffic over KBUS, or the
number of senders/listeners. Granted I personally wouldn't recommend
sending very large messages (I'm doubtful of the sanity of anything
over a few MB, myself, although KBUS will cope with multiple page
messages - albeit rather slowly), or expecting KBUS to be fast
(whatever "fast" means), but I'm reluctant to put those assumptions
into the design.

The inside of KBUS would indeed be slightly simpler if it did not
perform the message filtering (and this is substantially unoptimised
at the moment). However, if the client receives all of the messages,
that's an awful lot more copying being done. Within the kernel module,
message content is reference counted, but as data goes across the
kernel-to-userspace boundary, all of it gets copied. In the current
system, one can happily send a message knowing that it will not get
sent to recipients who do not care, and thus not worry much about the
cost in CPU, memory and so on. In the non-filtering system, such
concerns would need to be a major issue (spamming many clients with a
single large message that they are going to ignore could be a very big
deal, and would be relatively hard to defend against).

You also suggest splitting buses up into a finer granularity, in the
hope that this would cause less userspace filtering to be necessary.
I'm uncomfortable with that suggestion because it is just that, a
suggestion as to how the user might do things. It doesn't address the
problem in a technical manner at all.

I'd also note that there is virtue in having the unsplit buses. In the
current system, it is sensible to say that all messages for one task
will be on bus 0, and messages for another task on bus 1, and the two
cannot interact. One has, if you will, multiple message namespaces. It
makes sense to say that a program will only send messages on bus 0,
without needing to list the messages. In the proposed new system, a
single task will typically need to span multiple buses, and we've lost
a useful distinction.  Thus the original approach is a win for
documentation and pedagogy, if nothing else.

Replier buses
-------------
Putting replier registration on a bus basis. Hmm. So a recipient would
"bind" to a bus as replier *for that bus". Would all messages on that
bus be seen as requests by the replier? I think it would have to be
so, if only because marking messages as requests *as well* leads to
all sorts of possible confusions.

What if the recipient were monitoring messages as well, so it would
also want to "just receive" the requests? Presumably it would have to
open another connection to the bus to receive requests as ordinary
messages. OK.

Meanwhile, the sender presumably has to indicate that this bus is a
replier bus, with an appropriate setsockopt call. Note that this means
that everyone needs to know beforehand the id of that bus, and we are
getting perilously closer to needing some sort of manager of bus
ids/names (this makes me uncomfortable), or having a formalism about
how buses are named (ditto).

So sending now looks more like:

1. Sender opens bus 0

2. If this is to be a replier bus, sender marks it as such, via
   setsockopt. Definitely not via an ioctl.

   Note that we can't check for someone registered as a replier at
   this point, as they might reasonably not have connected to the bus
   yet.

3. Sender creates a message
4. Sender sends the message.

   As before, in the sockety manner.

5. If this is a replier bus, then KBUS checks to see if the sender has
   enough room to receive a reply on it.

6. KBUS assigns the next message id to the message

   In current KBUS, the message id is unique on each bus, and buses
   are isolated from each other. That doesn't work now, because we
   need the recipient to be able to reconstruct message ordering
   across buses. So the new id generation mechanism needs to be
   bus-independent. Using a 64-bit id should probably give us at least
   the id "granularity" that the current 32-bit id does.

7. If this is a replier bus, then KBUS checks to see if there is a
   replier bound to it (thus, leaving it as late as possible, and
   giving the most chance the replier will be there). If no-one has
   bound as replier, then the send fails. Similarly, if the replier
   does not have room in their queue, the send fails.

8. KBUS copies the message to the queues of everyone who has bound to
   receive it.

At the recipient end, the sequence is then presumably something like:

1. Listener opens bus 0.
2. Listener possibly chooses to be a replier for that bus, using
   setsockopt. It is an error if there is already a replier for the
   bus.

   Is it an error to bind as a replier on a bus that is not marked as
   such? I can't see that it can be, because otherwise we would have
   to fail with the race condition where:

   a. Listener opens bus
   b. Sender opens bus
   c. Listener binds to bus as replier
   d. Sender tells bus it is a replier bus

   So I think we have to allow a replier bound on a non-replier bus -
   they'd just never get any messages on it. Which means KBUS has to
   make sure not to send ordinary messages to a listener on a bus
   they've bound to as replier.

     (For a world of pain, invent a getsockopt option to check if a
     bus is marked as a replier bus, and wait for it to get set, and
     *then* bind as replier. But I wouldn't want to advise it.)

   That all feels rather messy, and is really one of the sorts of
   reason we went with just marking messages and leaving buses as
   message agnostic transports.

3. Listener determines if there is a next message, and if so, its
   length.

   Again, as said before, I'm not sure how this would be done.

4. Listener allocates the appropriate number of buffers to receive the
   message.

5. Listener reads the message into the buffers.

6. If the message was read via a socket with the "replier" socket
   option set (one assumes the recipient remembers this), then the
   listener needs to send a reply over that same socket.

Conceptually, this does look like it would work (subject to the
binding nastiness mentioned), and it's clearly more-or-less a dual of
the approach we've already taken. If we were splitting current KBUS
buses into finer granularities, it would be a reasonable approach to
consider.

The problem with splitting related messages over many buses
-----------------------------------------------------------
The trouble is that, whilst the recipient is guaranteed to receive
messages *on a given bus* in the correct order, this is no longer
sufficient, as the order we care about is now split over multiple
buses.

You propose that the recipient should reassemble the message order.
This is clearly possible if they are receiving all messages, but at
the cost of having to keep a list (potentially a very long list) of
message ids received and outstanding, and only "releasing" a message
when all preceding message ids have been encountered. A colleague's
comment on this was that we should not be reimplementing TCP/IP in
userspace. I'd just say that this is a non-trivial problem, and if
every recipient has to do it, a potential burden on the performance of
the whole system (especially if we're talking many buses), so it
should be done at the place that causes fewest reimplementations,
i.e., in the kernel module. Which puts us back where we were.

(Obviously, if the recipient is not getting *all* messages, then this
problem is unsolvable, since it cannot know which message ids are
missing - i.e., if it receives messages 5, 9 and 7, it has no way of
knowing whether it should also have received message 8.)

> In short, API minimalism is key to acceptance in the upstream kernel.
> Try to pare down the core API to the bare minimum to get what you
> need, rather than implementing your final use case directly into the
> kernel using ioctls or whatnot.

Hmm. As I recall, when starting KBUS development we said "what's the
simplest API we can present to the user to do the job", at the same
time as asking "what's the simplest set of functionalities that we
need to provide". So, in a very real sense, we did start by trying to
pare down the core API.

Of course, that same aim led us to reject trying to force sockets to
do the job just because "sockets are used for messaging". Not that
they always are, of course - one doesn't classically communicate with
DSPs over sockets, for instance.

> Thanks,
> Bryan

Thanks again,
Tibs

--
To unsubscribe from this list: send the line "unsubscribe linux-embedded" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html