On Mon, Oct 1, 2018 at 7:36 PM John Spray <jspray@xxxxxxxxxx> wrote: > > On Mon, Oct 1, 2018 at 12:00 PM kefu chai <tchaikov@xxxxxxxxx> wrote: > > > > hi guys, > > > > i noticed that Shengjing raised his concern regarding to the .joblib > > files introduced along with the diskprediction plugin[0,1]. these data > > files are released under public domain. and because the source of > > these files are not released at this moment, he argued that "this is > > still not free". i agree, to some degree, it's arguable that they're > > not free as in the sense of "free software", or compliant to DFSG[2] > > to be specific, but i believe the license is valid per se. > > In my non-legally-trained opinion, when distributing public domain > files the key distinction is whether these files are considered > software (i.e. having source code) or just data. > > If we consider these files to be software, then it's correct to say > that a public domain binary is non-free. If we consider them data, > then a public domain binary is just a piece of data (analogous to > distributing a .jpeg file but not the photographer's original .raw > file). I would lean toward the second view -- in my view, machine > learning datasets are not source code, as they're numeric data rather > than computer instructions. yeah, i expect it's arguable opinion that pre-trained model is part of software which should come with its source, if one claims its free software. but please note, a JPEG can be edited with GIMP, and the modified JPEG file is still viewable in general sense. but this does not apply to statistic data, a well trained professional cannot easily modify it in a manner that preserves its clustering (in the sense of pattern recognition) performance. actually, i don't think it's a general practice to modify the models of machine learning manually. people tend to re-train the model using new/modified dataset or a new clustering approach. > > Still, it would certainly be a good thing to have the original > training data available, to avoid any possible ambiguity arising from > differing interpretations, and to make it obvious how others can > recreate models from alternative source data. i think the value of the dataset, tooling and document is to ensure that the user has the freedom to create/modify the models. > > John > > > > > i am wondering if we could move further by providing user the > > pre-labeled SMART dataset of all listed combination of SMART > > attributes combination in config.json , script and document for > > training them, if only commodity hardware and free software are > > required to process the dataset. so they are accessible to the public. > > and these dataset can be DFSG-free in this way? see tesseract-ocr[3] > > as an example. > > > > i know, there are some of discussions[4] regarding to the freedom > > versus machine learning models. but in our case, i think it's much > > simpler, because, unlike the dataset used by image/speech recognition, > > the scale/size of SMART attributes are much smaller than video/audio > > sequences, neither are they likely contain user data. i think it's > > even an opportunity for our user to train the dataset or label a > > good/bad disk, and to transit from a user to a contributor by > > contributing to the dataset. > > > > what do you think? > > > > cheers, > > > > -- > > [0] https://github.com/ceph/ceph/pull/22239 > > [1] https://github.com/ceph/ceph/pull/24104 > > [2] https://www.debian.org/social_contract#guidelines > > [3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699609 and > > https://github.com/tesseract-ocr/langdata > > [4] https://lwn.net/Articles/760142/ > > > > -- > > Regards > > Kefu Chai -- Regards Kefu Chai