the license of diskprediction's pre-trained models and more

kefu chai <tchaikov@xxxxxxxxx> · Mon, 1 Oct 2018 18:59:45 +0800

hi guys,

i noticed that Shengjing raised his concern regarding to the .joblib
files introduced along with the diskprediction plugin[0,1]. these data
files are released under public domain. and because the source of
these files are not released at this moment, he argued that "this is
still not free". i agree, to some degree, it's arguable that they're
not free as in the sense of "free software", or compliant to DFSG[2]
to be specific, but i believe the license is valid per se.

i am wondering if we could move further by providing user the
pre-labeled SMART dataset of all listed combination of SMART
attributes combination in config.json ,  script and document for
training them, if only commodity hardware and free software are
required to process the dataset. so they are accessible to the public.
and these dataset can be DFSG-free in this way? see tesseract-ocr[3]
as an example.

i know, there are some of discussions[4] regarding to the freedom
versus machine learning models. but in our case, i think it's much
simpler, because, unlike the dataset used by image/speech recognition,
the scale/size of SMART attributes are much smaller than video/audio
sequences, neither are they likely contain user data. i think it's
even an opportunity for our user to train the dataset or label a
good/bad disk, and to transit from a user to a contributor by
contributing to the dataset.

what do you think?

cheers,

--
[0] https://github.com/ceph/ceph/pull/22239
[1] https://github.com/ceph/ceph/pull/24104
[2] https://www.debian.org/social_contract#guidelines
[3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699609 and
https://github.com/tesseract-ocr/langdata
[4] https://lwn.net/Articles/760142/

-- 
Regards
Kefu Chai