Ping, anyone have any thoughts on these general ideas? I'm not suggesting we do everything at once neccessarily, but I'd like us to at least figure out a direction to move forwards in... On Wed, Jul 22, 2015 at 11:46:23AM +0100, Daniel P. Berrange wrote: > Currently we distribute the database alongside the library code. We have > long known this was no ideal over the long term, since the database is > updated much more frequently than the code needs to be, and also people > wish to be able to consume database updates without updating the code. > > As such, I think it is time to look at splitting the database off into > a separate package and git repository from the code, which we can then > distribute and release on separate timelines. In fact for the database > we'd probably just want to make available daily automated tar.gzs from > git, rather than doing manual releases. > > In the new GIT repository I think we'd need to have the following pieces > of the current codebase > > - data/ > - tests/isodata/ > - tests/test-isodetect.c > > When doing an automated release, we'd want to run some test to make > sure the database is syntactically valid & can be successfully loaded, > as well as the existing iso unit tests. > > I think we also need to consider what we need to do to future proof > ourselves, because once we start distributing the database separately > from the code, we are making a much stronger committment to supporting > the current database format long term. From that POV, I think we need > to consider a few things > > - Is XML the format we want to use long term ? > > We already ditched XML for the PCI & USB ID databases, in favour of > directly loading the native data sources because XML was just too > damn slow to parse. I'm concerned that as we increase the size of > the database we might find this becoming a more general problem. > > So should we do experiments to see if something like JSON or YAML > is faster to load data from ? > > If we want to use a different format, should we do it exclusively > or in parallel > > eg should we drop XML support if we switch to JSON, or should > we keep XML support and automatically generate a JSON version > of the database. > > - Do we need to mark a database schema version in some manner ? > > eg any time we add new attribute or elements to the schema we > should increment the version number. That would allow the library > to detect what version data it is receiving. Even though old > libraries should be fine accepting new database versions, and > new libraries should be fine accepting old database versions, > experiance has told me we should have some kind of versioning > infomation as a "get out of jail free card" > > - Should we restructure the database ? > > eg, we have a single data/oses/fedora.xml file that contains > the data for every Fedora release. This is already 200kb in > size and will grow forever. If we split up all the files > so there is only ever one entity (os, hypervisor, device, etc) > in each XML file, each file will be smaller in size. This would > also let us potentially do database minimization. eg we could > provide a download that contains /all/ OS, and another download > that contains only non-end-of-life OS. > > - Should we formalize the specification so that we can officially > support other library implementations > > While libosinfo is accessible from many languages via GObject > introspection, some projects are still loathe to consume python > libraries backed by native code. eg openstack would really > prefer to be able to just pip install a pure python impl. > > Currently libosinfo library includes some implicit business > logic about how you load the database, and dealing with overrides > from different files. eg if you have the same OS ID defined in > multiple XML files which one "wins". Also which paths are supposed > to be considered when loading files. In the future also possibly > how to download live updates over the net. It also has logic about > how you detect ISO images & install trees from the media data and > how to generate kick start files, etc, none of which is formally > specified or documented. > > - How do we provide notifications when updates are available > > eg, we don't want 1000's of clients checking the libosinfo website > daily to download a new database, if it hasn't changed since they > last checked. Can we efficiently provide info about database updates > so people can check and avoid downloading if it hasn't changed. I > have thought about perhaps adding a DNS TXT record that records > the SHA256 checksum of the database, so clients can do a simple > DNS lookup to check for update availability. This is nice and scalable > thanks to DNS server caching & TTLs, avoiding hitting the webserver > most of the time. > > Regards, > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| > |: http://libvirt.org -o- http://virt-manager.org :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| > > _______________________________________________ > Libosinfo mailing list > Libosinfo@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libosinfo Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| _______________________________________________ Libosinfo mailing list Libosinfo@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libosinfo