Re: fedora messaging

Michal Novotny <clime@xxxxxxxxxx> · Thu, 16 Aug 2018 12:53:29 +0200

On Thu, Aug 16, 2018 at 11:43 AM Jeremy Cline <jeremy@xxxxxxxxxx> wrote:
On 08/15/2018 01:53 PM, Michal Novotny wrote:>> 1. Make catching

accidental schema changes as a publisher easy.

> 

> So we can just solve this by registering the scheme with the publisher

> first before any content gets published and based on the scheme, the

> publisher

> instance may check if the content intended to be sent conforms to

> the scheme, which could catch some bugs before the content

> is actually sent. If we require this to be done on publisher side, then

> there is actually no reason to send the schema alongside the content

> because the check has already been done so consumer already knows

> the message is alright when it is received. What should be sent, however,

> is a scheme ID, e.g. just a natural number. The scheme ID may be then

> used to version the scheme, which would be available somewhere publicly

> e.g. in the service docs the same way Github/Gitlab/etc publishes structures

> of their webhook messages. It would be basically part of public API of

> a service.

Yes, the schema needs unique identifier. This is currently provided by

the full Python path of the class, but could be done a different way, as

you point out.

> 

>> 2. Make catching mis-behaving publishers on the consuming side easy.

> 

> By checking against the scheme on the publisher side, this

> shouldn't be necessary. If someone somehow bypasses the

> publisher check, at worst the message won't be parsable,

> depending on how the message is being parsed. If someone

> wants to really make sure the message is what it is supposed

> to be, he/she can integrate the schema published on the service

> site into the parsing logic but I don't think that's necessary

> thing to do (I personally wouldn't do it in my code).

> 

So most of the work in fedora-messaging is to make this as painless as

possible. Checking the message on both the publisher side and consumer

side is helpful for a few reasons. For example, the publisher changes

the schema and fails to change the unique identifier for it (however

that's being defined)?

>> 3. Make changing the schema a painless process for publishers and

>    consumers.

> 

> I think, the only way to do this is to send both content types

> simultaneously

> for some time, each message being marked with its scheme ID. It would be

> good if consumer always specified what scheme ID it wants to consume.

> If there is a higher scheme ID available in the message, a warning could be

> printed

> maybe even to syslog even so that consumers get the information. At the

> same time it should

> be communicated on the service site or by other means available. I don't

> think it is possible

> to make it any better than this.

It's not the only way to do it, but it is one way to do it. It's all a

matter of complexity and where you want to put that complexity. You can,

for example, have a scheme where routing key (topic) includes the schema

identity. With this approach, apps need to publish both for a while, and

then drop the old topic at some point. That's fine.

You can run an intermediate service that knows the various schema and

publishes every version. That's fine, too.

You can do what we opted to do and produce a schema, wrap it in a

high-level API, and then change it without breaking that high-level API.

This, of course, requires distributing that high-level API, thus the

packaging. If this _isn't_ the route we go, the rule basically becomes

"no changing your message schema ever, just make a new type".

It all comes to the same thing in the end, it's just a matter of where

you deal with compatibility changes. What I'm advocating for is not

making the wire format (some JSON dictionary) the public API because

that has worked very poorly for fedmsg. If the wire format was something

like a protocol buffer, it has some of these ideas built in and is more

reasonable to use directly (although in some cases of higher-level APIs

can be useful).

> 

> I fail to see what's the point of packaging the schemas.

> If the message content is in json, then after receiving the message,

> I would like to be able to just call json.loads(msg) and work with the

> resulting structure

> as I am used to.

> 

> Actually, what I would do in python is that I would make it a munch and

> then work

> with it. Needing to install some additional package and instantiate some

> high-level

> objects just seems clumsy to me in comparison.

It's certainly more work, yes. What you get in exchange is freedom to

change your wire format. Maybe that's not something you need or want.

The point of packaging the schema is to distribute the Python API

without doing something crazy like pickle.

And for the record, I'm all in favor of running a PyPI mirror and

deploying our apps to OpenShift with s2i, thus skipping RPM entirely.

I'm fine with automatically converting them to RPM with a tool, too.

The Python packaging is trivial (5 minutes, there's a template, then

just upload it to PyPI).

> 

> In other programming languages, this procedure would be pretty much the

> same,

> I believe as they all probably provide some json implementation.

> 

> You mentioned:

> 

>> In the current proposal, consumers don't interact with the JSON at all,

>> but with a higher-level Python API that gives publishers flexibility

>> when altering their on-the-wire format.

> 

> Yes, but with the current proposal if I change the on-the-wire API, I need

> to make a new version of the schema, package it and somehow get it to

> consumers and make them use the correct version that correctly parses

> the new on-the-wire format and translates it correctly to what the consumers

> are used to consume? That's seems like something very difficult to get

> done.

Yes, you need to do that. If you don't do that, your alternative is a

flag day where you update the producer and consumers at the exact same

time and make sure no messages linger in queues. It's what we do now and

it doesn't work.

Well, or send both messages simultaneously for some time.

>> The big problem is that right now the majority of messages are not

>> formatted in a way that makes sense and really need to be changed to be

>> simple, flat structures that contain the information services need and

>> nothing they don't. I'd like to get those fixed in a way that doesn't

>> require massive coordinated changes in apps.

> 

> In Copr, for example, we take this as an opportunity to change our

> format. If the messaging framework will support format deprecation,

> we might go that way as well to avoid sudden change. But we don't

> currently have many (or maybe any) consumers so I am not sure it is

> necessary for us.

> 

> I am not familiar with protocol buffers but to me that thing

> seems rather useful, if you want to send the content in a compact

> binary form to save as much space as possible. If we will send content,

> which can be interpreted as json already, then to make some

> higher-level classes and objects on that seems already unnecessary.

> 

> I think we could really just take that already existing generic framework

> you were talking about (RabbitMQ?) and just make sure we can

> check the content against message schemas on producer side (which is

> great for catching little bugs) and that we know how a message format can

> get deprecated (e.g. by adding "deprecated_by: <topic>" field into each

> message

> by the messaging framework, which should somehow log warnings on

> consumer side), also the framework could automatically

> transform the messages into some language-native structures:

> in python, the munches would probably be the most sexy ones.

> 

> The whole "let's package schemas" thing seems like something

> we would typically do (because we are packagers) but not as something

> that would solve the actual problems you have mentioned. Rather it

> makes them more difficult to deal with if I am correct.

As many people will tell you, I am not a big believer in the "let's turn

everything into RPMs manually" idea. I'm fine, as I mentioned, with just

a Python package.

However, I think you're underestimating the power of a high-level API.

Consider, for example, message notifications (FMN and whatever its

successor looks like).

There needs to be a consistent way to take that message, whatever its

wire format, and turn it into a human-readable message. There needs to

be a consistent way to extract the users associated with a message.

There needs to be a way to extract what packages are affected by the

message.

The current solution is fedmsg-meta-fedora-infrastructure, a central

Python module where schema are poorly encoded by a series of if/else

statements. It also regularly breaks when message schema change. For

example, I have 2200 emails from the notifications system about how some

Copr and Bodhi messages are unparsable. No one remembers to update the

package, and it ultimately means their messages are dropped or arrive to

users as incomprehensible JSON.

Yup, on behalf of Copr, I am sorry for that. This was caused by some bugs in
our code. But these things would be captured by the publisher validation in
the new framework. By the way, we would also like to have validators like
"NEVRA" available, maybe in a library, maybe we can implement it ourselves.
In one of the instances, we weren't sending release (I think) and it broke the 
fedmsg-meta service. That service is kind of sensitive.

With the current approach, you can just implement a __str__ method on a

class you keep in the same Git repo you use for your project. You can

write documentation on the classes so users can see what messages your

projects send. You can release it whenever you see fit, not when whoever

maintains fedmsg-meta has time to make a release.

It seems like your main objection is the Python package. Personally, I

think making a Python package is a trivial amount of work for the

benefit of being able to define an arbitrary Python API to work with

your messages, but maybe that's not a widely-shared sentiment. If it's

not and we decide the only thing we really want in addition to the

message is a human-readable string, maybe we could include that in the

message in a standard way. 

Might be also a way.

Things like i18n notifications might no

longer be as easy, though.

-- 

Jeremy Cline

XMPP: jeremy@xxxxxxxxxx

IRC:  jcline

_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx/message/7IWY6VJ44MKY7HWCQ7BAVFRWRYNR7FIQ/