Re: RFC: Splitting off database into separate package

"Zeeshan Ali (Khattak)" <zeeshanak@xxxxxxxxx> · Fri, 24 Jul 2015 13:26:06 +0100

Hi Daniel,

On Wed, Jul 22, 2015 at 11:46 AM, Daniel P. Berrange
<berrange@xxxxxxxxxx> wrote:
> Currently we distribute the database alongside the library code. We have
> long known this was no ideal over the long term, since the database is
> updated much more frequently than the code needs to be, and also people
> wish to be able to consume database updates without updating the code.
>
> As such, I think it is time to look at splitting the database off into
> a separate package and git repository from the code, which we can then
> distribute and release on separate timelines. In fact for the database
> we'd probably just want to make available daily automated tar.gzs from
> git, rather than doing manual releases.

That would be nice indeed and I have always wished for such a system.

> When doing an automated release, we'd want to run some test to make
> sure the database is syntactically valid & can be successfully loaded,
> as well as the existing iso unit tests.

Indeed.

> I think we also need to consider what we need to do to future proof
> ourselves, because once we start distributing the database separately
> from the code, we are making a much stronger committment to supporting
> the current database format long term. From that POV, I think we need
> to consider a few things
>
>  - Is XML the format we want to use long term ?
>
>    We already ditched XML for the PCI & USB ID databases, in favour of
>    directly loading the native data sources because XML was just too
>    damn slow to parse. I'm concerned that as we increase the size of
>    the database we might find this becoming a more general problem.
>
>    So should we do experiments to see if something like JSON or YAML
>    is faster to load data from ?

I was thinking maybe we could have our own binary format that we
transform the database to on loading and then have a caching mechanism
in place so it's only done once per installation per version. There
will be a 1-1 correspondence between xml and generated files so that
cache is always used if corresponding xml file has not changed in a
new version.

>  - Do we need to mark a database schema version in some manner ?
>
>    eg any time we add new attribute or elements to the schema we
>    should increment the version number. That would allow the library
>    to detect what version data it is receiving. Even though old
>    libraries should be fine accepting new database versions, and
>    new libraries should be fine accepting old database versions,
>    experiance has told me we should have some kind of versioning
>    infomation as a "get out of jail free card"

Yeah, although i'd do this as last part of this mega project.

>  - Should we restructure the database ?
>
>    eg, we have a single data/oses/fedora.xml file that contains
>    the data for every Fedora release. This is already 200kb in
>    size and will grow forever. If we split up all the files
>    so there is only ever one entity (os, hypervisor, device, etc)
>    in each XML file, each file will be smaller in size. This would
>    also let us potentially do database minimization. eg we could
>    provide a download that contains /all/ OS, and another download
>    that contains only non-end-of-life OS.

Or we could simply put end-of-life OS into separate xml files? Having
a separate xml file for each os entry would imply loads of files and
I/O performance at load time might become an issue.

>  - Should we formalize the specification so that we can officially
>    support other library implementations
>
>    While libosinfo is accessible from many languages via GObject
>    introspection, some projects are still loathe to consume python
>    libraries backed by native code. eg openstack would really
>    prefer to be able to just pip install a pure python impl.

Sure but i'd also keep this very low priority nice-to-have item.

>  - How do we provide notifications when updates are available
>
>    eg, we don't want 1000's of clients checking the libosinfo website
>    daily to download a new database, if it hasn't changed since they
>    last checked. Can we efficiently provide info about database updates
>    so people can check and avoid downloading if it hasn't changed. I
>    have thought about perhaps adding a DNS TXT record that records
>    the SHA256 checksum of the database, so clients can do a simple
>    DNS lookup to check for update availability. This is nice and scalable
>    thanks to DNS server caching & TTLs, avoiding hitting the webserver
>    most of the time.

If it would work, sounds great!

-- 
Regards,

Zeeshan Ali (Khattak)
________________________________________
Befriend GNOME: http://www.gnome.org/friends/

_______________________________________________
Libosinfo mailing list
Libosinfo@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libosinfo