On Mon, Feb 26, 2024 at 6:32 PM Tim Flink <tflink@xxxxxxxxxxxxxxxxx> wrote: > 1. Are pre-trained weights considered to be normal non-code content/data or do they require special handling? My thought is that they should be considered "content" for Fedora packaging purposes. The legal docs say: "For purposes of Fedora license classification, “content” means any material that is not clearly code, documentation, fonts or firmware. Here are some examples of content: graphic image files audio files nonfunctional data sets AppStream metainfo.xml files standards documents certain files relating to functionality and management of markup languages, including XML schema files and resource resolution files, XSL files, SGML declaration files, and ancillary informal documentation accompanying such files" First, aren't trained model weights a kind of "nonfunctional data set"? (We don't define what "nonfunctional" means -- I'm pretty sure we copied that phrase from the old wiki documentation -- and frankly I'm not sure what it means, but I think it goes to the nonexecutable nature of the data. Weights don't function by themselves.) Second, it seems to me that pretrained model weights are "not clearly code, documentation, fonts or firmware". One of the purposes of the content category is to allow relaxed license criteria for noncode/non-documentation things needed by Fedora packages. Note though that the current relaxed criteria only extends to two features: "The license may restrict or prohibit modification The license may say that it does not cover patents or grant any patent licenses" (the latter being a reference to CC0) However, there is a reason why I felt it was important to bump this issue to FESCo. I thought FESCo might wish to take a position that pretrained weights, being the result of a training process on some training data, are analogous to object code (even if not "code" for Fedora license classification purposes). It sounds like they don't want to take a position on this. This topic relates very closely to certain current issues of interest in the wider world, for example the Open Source Initiative's effort to define "Open Source AI" (see: https://discuss.opensource.org/) There is definitely some sentiment among some participants in that effort that, for a so-called "AI system" to be "open source", training data must be "open", largely because it is thought that this is necessary for users to exercise rights of modification. I don't think that debate is dispositive of the Fedora question. If a Fedora package contains pretrained weights, it is not necessarily an assertion that such a package is "open source" in a precise sense, any more than Fedora is asserting that firmware packages are "open source". It is true that Fedora cannot reasonably claim to be 100% FOSS if it packages stuff like firmware, or "content" under licenses that prohibit modification. You might say that holders of those viewpoints in the OSI effort are adopting a view that model weights are "code", if you map things to Fedora license approval concepts. Anyway, I'm struggling to see a justification for not classifying pretrained weights as "content". I am not sure it is of much practical significance though. > 2. If an upstream offers pre-trained weights and indicates that those weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages? This is what I thought ought to be a FESCo question. If FESCo doesn't actually care and sees this as a Fedora legal question, then this question is really equivalent to the first question, isn't it? If it's "content", and it's under a license acceptable for "content", then as far as Fedora legal is concerned it can be included in Fedora packages. > 3. Extending question 2, is it considered sufficient for an upstream to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable? So this is where I think we should initially be a little cautious and look at these things on a case-by-case basis, perhaps until we get more experience with handling this topic. Maybe there could be circumstances where given what is disclosed, or not disclosed, about how a model was trained, we might want to not package the pretrained weights in Fedora. I think that is unlikely, but not impossible. > 4. Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are > a. For a specific model? > b. For a user-defined model which may or may not exist at the time of packaging? > > > > I can provide examples of any of these situations if that would be helpful. Can you elaborate on 4a/4b with examples? Richard -- _______________________________________________ legal mailing list -- legal@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to legal-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/legal@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue