Re: the license of diskprediction's pre-trained models and more

kefu chai <tchaikov@xxxxxxxxx> · Mon, 1 Oct 2018 23:59:36 +0800

On Mon, Oct 1, 2018 at 7:36 PM John Spray <jspray@xxxxxxxxxx> wrote:
>
> On Mon, Oct 1, 2018 at 12:00 PM kefu chai <tchaikov@xxxxxxxxx> wrote:
> >
> > hi guys,
> >
> > i noticed that Shengjing raised his concern regarding to the .joblib
> > files introduced along with the diskprediction plugin[0,1]. these data
> > files are released under public domain. and because the source of
> > these files are not released at this moment, he argued that "this is
> > still not free". i agree, to some degree, it's arguable that they're
> > not free as in the sense of "free software", or compliant to DFSG[2]
> > to be specific, but i believe the license is valid per se.
>
> In my non-legally-trained opinion, when distributing public domain
> files the key distinction is whether these files are considered
> software (i.e. having source code) or just data.
>
> If we consider these files to be software, then it's correct to say
> that a public domain binary is non-free.  If we consider them data,
> then a public domain binary is just a piece of data (analogous to
> distributing a .jpeg file but not the photographer's original .raw
> file).  I would lean toward the second view -- in my view, machine
> learning datasets are not source code, as they're numeric data rather
> than computer instructions.

yeah, i expect it's arguable opinion that pre-trained model is part of
software which should come with its source, if one claims its free
software. but please note, a JPEG can be edited with GIMP, and the
modified JPEG file is still viewable in general sense. but this does
not apply to statistic data, a well trained professional cannot easily
modify it in a manner that preserves its clustering (in the sense of
pattern recognition) performance. actually, i don't think it's a
general practice to modify the models of machine learning manually.
people tend to re-train the model using new/modified dataset or a new
clustering approach.

>
> Still, it would certainly be a good thing to have the original
> training data available, to avoid any possible ambiguity arising from
> differing interpretations, and to make it obvious how others can
> recreate models from alternative source data.

i think the value of the dataset, tooling and document is to ensure
that the user has the freedom to create/modify the models.

>
> John
>
>
>
> > i am wondering if we could move further by providing user the
> > pre-labeled SMART dataset of all listed combination of SMART
> > attributes combination in config.json ,  script and document for
> > training them, if only commodity hardware and free software are
> > required to process the dataset. so they are accessible to the public.
> > and these dataset can be DFSG-free in this way? see tesseract-ocr[3]
> > as an example.
> >
> > i know, there are some of discussions[4] regarding to the freedom
> > versus machine learning models. but in our case, i think it's much
> > simpler, because, unlike the dataset used by image/speech recognition,
> > the scale/size of SMART attributes are much smaller than video/audio
> > sequences, neither are they likely contain user data. i think it's
> > even an opportunity for our user to train the dataset or label a
> > good/bad disk, and to transit from a user to a contributor by
> > contributing to the dataset.
> >
> > what do you think?
> >
> > cheers,
> >
> > --
> > [0] https://github.com/ceph/ceph/pull/22239
> > [1] https://github.com/ceph/ceph/pull/24104
> > [2] https://www.debian.org/social_contract#guidelines
> > [3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699609 and
> > https://github.com/tesseract-ocr/langdata
> > [4] https://lwn.net/Articles/760142/
> >
> > --
> > Regards
> > Kefu Chai

-- 
Regards
Kefu Chai