Hi Daniel, On Wed, Jul 22, 2015 at 11:46 AM, Daniel P. Berrange <berrange@xxxxxxxxxx> wrote: > Currently we distribute the database alongside the library code. We have > long known this was no ideal over the long term, since the database is > updated much more frequently than the code needs to be, and also people > wish to be able to consume database updates without updating the code. > > As such, I think it is time to look at splitting the database off into > a separate package and git repository from the code, which we can then > distribute and release on separate timelines. In fact for the database > we'd probably just want to make available daily automated tar.gzs from > git, rather than doing manual releases. That would be nice indeed and I have always wished for such a system. > When doing an automated release, we'd want to run some test to make > sure the database is syntactically valid & can be successfully loaded, > as well as the existing iso unit tests. Indeed. > I think we also need to consider what we need to do to future proof > ourselves, because once we start distributing the database separately > from the code, we are making a much stronger committment to supporting > the current database format long term. From that POV, I think we need > to consider a few things > > - Is XML the format we want to use long term ? > > We already ditched XML for the PCI & USB ID databases, in favour of > directly loading the native data sources because XML was just too > damn slow to parse. I'm concerned that as we increase the size of > the database we might find this becoming a more general problem. > > So should we do experiments to see if something like JSON or YAML > is faster to load data from ? I was thinking maybe we could have our own binary format that we transform the database to on loading and then have a caching mechanism in place so it's only done once per installation per version. There will be a 1-1 correspondence between xml and generated files so that cache is always used if corresponding xml file has not changed in a new version. > - Do we need to mark a database schema version in some manner ? > > eg any time we add new attribute or elements to the schema we > should increment the version number. That would allow the library > to detect what version data it is receiving. Even though old > libraries should be fine accepting new database versions, and > new libraries should be fine accepting old database versions, > experiance has told me we should have some kind of versioning > infomation as a "get out of jail free card" Yeah, although i'd do this as last part of this mega project. > - Should we restructure the database ? > > eg, we have a single data/oses/fedora.xml file that contains > the data for every Fedora release. This is already 200kb in > size and will grow forever. If we split up all the files > so there is only ever one entity (os, hypervisor, device, etc) > in each XML file, each file will be smaller in size. This would > also let us potentially do database minimization. eg we could > provide a download that contains /all/ OS, and another download > that contains only non-end-of-life OS. Or we could simply put end-of-life OS into separate xml files? Having a separate xml file for each os entry would imply loads of files and I/O performance at load time might become an issue. > - Should we formalize the specification so that we can officially > support other library implementations > > While libosinfo is accessible from many languages via GObject > introspection, some projects are still loathe to consume python > libraries backed by native code. eg openstack would really > prefer to be able to just pip install a pure python impl. Sure but i'd also keep this very low priority nice-to-have item. > - How do we provide notifications when updates are available > > eg, we don't want 1000's of clients checking the libosinfo website > daily to download a new database, if it hasn't changed since they > last checked. Can we efficiently provide info about database updates > so people can check and avoid downloading if it hasn't changed. I > have thought about perhaps adding a DNS TXT record that records > the SHA256 checksum of the database, so clients can do a simple > DNS lookup to check for update availability. This is nice and scalable > thanks to DNS server caching & TTLs, avoiding hitting the webserver > most of the time. If it would work, sounds great! -- Regards, Zeeshan Ali (Khattak) ________________________________________ Befriend GNOME: http://www.gnome.org/friends/ _______________________________________________ Libosinfo mailing list Libosinfo@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libosinfo