Currently we distribute the database alongside the library code. We have long known this was no ideal over the long term, since the database is updated much more frequently than the code needs to be, and also people wish to be able to consume database updates without updating the code. As such, I think it is time to look at splitting the database off into a separate package and git repository from the code, which we can then distribute and release on separate timelines. In fact for the database we'd probably just want to make available daily automated tar.gzs from git, rather than doing manual releases. In the new GIT repository I think we'd need to have the following pieces of the current codebase - data/ - tests/isodata/ - tests/test-isodetect.c When doing an automated release, we'd want to run some test to make sure the database is syntactically valid & can be successfully loaded, as well as the existing iso unit tests. I think we also need to consider what we need to do to future proof ourselves, because once we start distributing the database separately from the code, we are making a much stronger committment to supporting the current database format long term. From that POV, I think we need to consider a few things - Is XML the format we want to use long term ? We already ditched XML for the PCI & USB ID databases, in favour of directly loading the native data sources because XML was just too damn slow to parse. I'm concerned that as we increase the size of the database we might find this becoming a more general problem. So should we do experiments to see if something like JSON or YAML is faster to load data from ? If we want to use a different format, should we do it exclusively or in parallel eg should we drop XML support if we switch to JSON, or should we keep XML support and automatically generate a JSON version of the database. - Do we need to mark a database schema version in some manner ? eg any time we add new attribute or elements to the schema we should increment the version number. That would allow the library to detect what version data it is receiving. Even though old libraries should be fine accepting new database versions, and new libraries should be fine accepting old database versions, experiance has told me we should have some kind of versioning infomation as a "get out of jail free card" - Should we restructure the database ? eg, we have a single data/oses/fedora.xml file that contains the data for every Fedora release. This is already 200kb in size and will grow forever. If we split up all the files so there is only ever one entity (os, hypervisor, device, etc) in each XML file, each file will be smaller in size. This would also let us potentially do database minimization. eg we could provide a download that contains /all/ OS, and another download that contains only non-end-of-life OS. - Should we formalize the specification so that we can officially support other library implementations While libosinfo is accessible from many languages via GObject introspection, some projects are still loathe to consume python libraries backed by native code. eg openstack would really prefer to be able to just pip install a pure python impl. Currently libosinfo library includes some implicit business logic about how you load the database, and dealing with overrides from different files. eg if you have the same OS ID defined in multiple XML files which one "wins". Also which paths are supposed to be considered when loading files. In the future also possibly how to download live updates over the net. It also has logic about how you detect ISO images & install trees from the media data and how to generate kick start files, etc, none of which is formally specified or documented. - How do we provide notifications when updates are available eg, we don't want 1000's of clients checking the libosinfo website daily to download a new database, if it hasn't changed since they last checked. Can we efficiently provide info about database updates so people can check and avoid downloading if it hasn't changed. I have thought about perhaps adding a DNS TXT record that records the SHA256 checksum of the database, so clients can do a simple DNS lookup to check for update availability. This is nice and scalable thanks to DNS server caching & TTLs, avoiding hitting the webserver most of the time. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| _______________________________________________ Libosinfo mailing list Libosinfo@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libosinfo