On 08/15/2018 01:53 PM, Michal Novotny wrote:>> 1. Make catching accidental schema changes as a publisher easy. > > So we can just solve this by registering the scheme with the publisher > first before any content gets published and based on the scheme, the > publisher > instance may check if the content intended to be sent conforms to > the scheme, which could catch some bugs before the content > is actually sent. If we require this to be done on publisher side, then > there is actually no reason to send the schema alongside the content > because the check has already been done so consumer already knows > the message is alright when it is received. What should be sent, however, > is a scheme ID, e.g. just a natural number. The scheme ID may be then > used to version the scheme, which would be available somewhere publicly > e.g. in the service docs the same way Github/Gitlab/etc publishes structures > of their webhook messages. It would be basically part of public API of > a service. Yes, the schema needs unique identifier. This is currently provided by the full Python path of the class, but could be done a different way, as you point out. > >> 2. Make catching mis-behaving publishers on the consuming side easy. > > By checking against the scheme on the publisher side, this > shouldn't be necessary. If someone somehow bypasses the > publisher check, at worst the message won't be parsable, > depending on how the message is being parsed. If someone > wants to really make sure the message is what it is supposed > to be, he/she can integrate the schema published on the service > site into the parsing logic but I don't think that's necessary > thing to do (I personally wouldn't do it in my code). > So most of the work in fedora-messaging is to make this as painless as possible. Checking the message on both the publisher side and consumer side is helpful for a few reasons. For example, the publisher changes the schema and fails to change the unique identifier for it (however that's being defined)? >> 3. Make changing the schema a painless process for publishers and > consumers. > > I think, the only way to do this is to send both content types > simultaneously > for some time, each message being marked with its scheme ID. It would be > good if consumer always specified what scheme ID it wants to consume. > If there is a higher scheme ID available in the message, a warning could be > printed > maybe even to syslog even so that consumers get the information. At the > same time it should > be communicated on the service site or by other means available. I don't > think it is possible > to make it any better than this. It's not the only way to do it, but it is one way to do it. It's all a matter of complexity and where you want to put that complexity. You can, for example, have a scheme where routing key (topic) includes the schema identity. With this approach, apps need to publish both for a while, and then drop the old topic at some point. That's fine. You can run an intermediate service that knows the various schema and publishes every version. That's fine, too. You can do what we opted to do and produce a schema, wrap it in a high-level API, and then change it without breaking that high-level API. This, of course, requires distributing that high-level API, thus the packaging. If this _isn't_ the route we go, the rule basically becomes "no changing your message schema ever, just make a new type". It all comes to the same thing in the end, it's just a matter of where you deal with compatibility changes. What I'm advocating for is not making the wire format (some JSON dictionary) the public API because that has worked very poorly for fedmsg. If the wire format was something like a protocol buffer, it has some of these ideas built in and is more reasonable to use directly (although in some cases of higher-level APIs can be useful). > > I fail to see what's the point of packaging the schemas. > If the message content is in json, then after receiving the message, > I would like to be able to just call json.loads(msg) and work with the > resulting structure > as I am used to. > > Actually, what I would do in python is that I would make it a munch and > then work > with it. Needing to install some additional package and instantiate some > high-level > objects just seems clumsy to me in comparison. It's certainly more work, yes. What you get in exchange is freedom to change your wire format. Maybe that's not something you need or want. The point of packaging the schema is to distribute the Python API without doing something crazy like pickle. And for the record, I'm all in favor of running a PyPI mirror and deploying our apps to OpenShift with s2i, thus skipping RPM entirely. I'm fine with automatically converting them to RPM with a tool, too. The Python packaging is trivial (5 minutes, there's a template, then just upload it to PyPI). > > In other programming languages, this procedure would be pretty much the > same, > I believe as they all probably provide some json implementation. > > You mentioned: > >> In the current proposal, consumers don't interact with the JSON at all, >> but with a higher-level Python API that gives publishers flexibility >> when altering their on-the-wire format. > > Yes, but with the current proposal if I change the on-the-wire API, I need > to make a new version of the schema, package it and somehow get it to > consumers and make them use the correct version that correctly parses > the new on-the-wire format and translates it correctly to what the consumers > are used to consume? That's seems like something very difficult to get > done. Yes, you need to do that. If you don't do that, your alternative is a flag day where you update the producer and consumers at the exact same time and make sure no messages linger in queues. It's what we do now and it doesn't work. >> The big problem is that right now the majority of messages are not >> formatted in a way that makes sense and really need to be changed to be >> simple, flat structures that contain the information services need and >> nothing they don't. I'd like to get those fixed in a way that doesn't >> require massive coordinated changes in apps. > > In Copr, for example, we take this as an opportunity to change our > format. If the messaging framework will support format deprecation, > we might go that way as well to avoid sudden change. But we don't > currently have many (or maybe any) consumers so I am not sure it is > necessary for us. > > I am not familiar with protocol buffers but to me that thing > seems rather useful, if you want to send the content in a compact > binary form to save as much space as possible. If we will send content, > which can be interpreted as json already, then to make some > higher-level classes and objects on that seems already unnecessary. > > I think we could really just take that already existing generic framework > you were talking about (RabbitMQ?) and just make sure we can > check the content against message schemas on producer side (which is > great for catching little bugs) and that we know how a message format can > get deprecated (e.g. by adding "deprecated_by: <topic>" field into each > message > by the messaging framework, which should somehow log warnings on > consumer side), also the framework could automatically > transform the messages into some language-native structures: > in python, the munches would probably be the most sexy ones. > > The whole "let's package schemas" thing seems like something > we would typically do (because we are packagers) but not as something > that would solve the actual problems you have mentioned. Rather it > makes them more difficult to deal with if I am correct. As many people will tell you, I am not a big believer in the "let's turn everything into RPMs manually" idea. I'm fine, as I mentioned, with just a Python package. However, I think you're underestimating the power of a high-level API. Consider, for example, message notifications (FMN and whatever its successor looks like). There needs to be a consistent way to take that message, whatever its wire format, and turn it into a human-readable message. There needs to be a consistent way to extract the users associated with a message. There needs to be a way to extract what packages are affected by the message. The current solution is fedmsg-meta-fedora-infrastructure, a central Python module where schema are poorly encoded by a series of if/else statements. It also regularly breaks when message schema change. For example, I have 2200 emails from the notifications system about how some Copr and Bodhi messages are unparsable. No one remembers to update the package, and it ultimately means their messages are dropped or arrive to users as incomprehensible JSON. With the current approach, you can just implement a __str__ method on a class you keep in the same Git repo you use for your project. You can write documentation on the classes so users can see what messages your projects send. You can release it whenever you see fit, not when whoever maintains fedmsg-meta has time to make a release. It seems like your main objection is the Python package. Personally, I think making a Python package is a trivial amount of work for the benefit of being able to define an arbitrary Python API to work with your messages, but maybe that's not a widely-shared sentiment. If it's not and we decide the only thing we really want in addition to the message is a human-readable string, maybe we could include that in the message in a standard way. Things like i18n notifications might no longer be as easy, though. -- Jeremy Cline XMPP: jeremy@xxxxxxxxxx IRC: jcline _______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx/message/LPYJOWDCX5TPLDMIAEBK4ORUO4TVY7FI/