Re: the license of diskprediction's pre-trained models and more

John Spray <jspray@xxxxxxxxxx> · Mon, 1 Oct 2018 12:35:57 +0100

On Mon, Oct 1, 2018 at 12:00 PM kefu chai <tchaikov@xxxxxxxxx> wrote:
>
> hi guys,
>
> i noticed that Shengjing raised his concern regarding to the .joblib
> files introduced along with the diskprediction plugin[0,1]. these data
> files are released under public domain. and because the source of
> these files are not released at this moment, he argued that "this is
> still not free". i agree, to some degree, it's arguable that they're
> not free as in the sense of "free software", or compliant to DFSG[2]
> to be specific, but i believe the license is valid per se.

In my non-legally-trained opinion, when distributing public domain
files the key distinction is whether these files are considered
software (i.e. having source code) or just data.

If we consider these files to be software, then it's correct to say
that a public domain binary is non-free.  If we consider them data,
then a public domain binary is just a piece of data (analogous to
distributing a .jpeg file but not the photographer's original .raw
file).  I would lean toward the second view -- in my view, machine
learning datasets are not source code, as they're numeric data rather
than computer instructions.

Still, it would certainly be a good thing to have the original
training data available, to avoid any possible ambiguity arising from
differing interpretations, and to make it obvious how others can
recreate models from alternative source data.

John

> i am wondering if we could move further by providing user the
> pre-labeled SMART dataset of all listed combination of SMART
> attributes combination in config.json ,  script and document for
> training them, if only commodity hardware and free software are
> required to process the dataset. so they are accessible to the public.
> and these dataset can be DFSG-free in this way? see tesseract-ocr[3]
> as an example.
>
> i know, there are some of discussions[4] regarding to the freedom
> versus machine learning models. but in our case, i think it's much
> simpler, because, unlike the dataset used by image/speech recognition,
> the scale/size of SMART attributes are much smaller than video/audio
> sequences, neither are they likely contain user data. i think it's
> even an opportunity for our user to train the dataset or label a
> good/bad disk, and to transit from a user to a contributor by
> contributing to the dataset.
>
> what do you think?
>
> cheers,
>
> --
> [0] https://github.com/ceph/ceph/pull/22239
> [1] https://github.com/ceph/ceph/pull/24104
> [2] https://www.debian.org/social_contract#guidelines
> [3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699609 and
> https://github.com/tesseract-ocr/langdata
> [4] https://lwn.net/Articles/760142/
>
> --
> Regards
> Kefu Chai