Re: AI/ML Model and Pre-Trained Weight Packaging in Fedora

Richard Fontana <rfontana@xxxxxxxxxx> · Mon, 26 Feb 2024 21:06:35 -0500

On Mon, Feb 26, 2024 at 6:32 PM Tim Flink <tflink@xxxxxxxxxxxxxxxxx> wrote:

> 1. Are pre-trained weights considered to be normal non-code content/data or do they require special handling?

My thought is that they should be considered "content" for Fedora
packaging purposes. The legal docs say:

"For purposes of Fedora license classification, “content” means any
material that is not clearly code, documentation, fonts or firmware.
Here are some examples of content:
graphic image files
audio files
nonfunctional data sets
AppStream metainfo.xml files
standards documents
certain files relating to functionality and management of markup
languages, including XML schema files and resource resolution files,
XSL files, SGML declaration files, and ancillary informal
documentation accompanying such files"

First, aren't trained model weights a kind of "nonfunctional data
set"? (We don't define what "nonfunctional" means -- I'm pretty sure
we copied that phrase from the old wiki documentation -- and frankly
I'm not sure what it means, but I think it goes to the nonexecutable
nature of the data. Weights don't function by themselves.)

Second, it seems to me that pretrained model weights are "not clearly
code, documentation, fonts or firmware". One of the purposes of the
content category is to allow relaxed license criteria for
noncode/non-documentation things needed by Fedora packages. Note
though that the current relaxed criteria only extends to two features:
"The license may restrict or prohibit modification
The license may say that it does not cover patents or grant any patent
licenses" (the latter being a reference to CC0)

However, there is a reason why I felt it was important to bump this
issue to FESCo. I thought FESCo might wish to take a position that
pretrained weights, being the result of a training process on some
training data, are analogous to object code (even if not "code" for
Fedora license classification purposes). It sounds like they don't
want to take a position on this.

This topic relates very closely to certain current issues of interest
in the wider world, for example the Open Source Initiative's effort to
define "Open Source AI" (see: https://discuss.opensource.org/) There
is definitely some sentiment among some participants in that effort
that, for a so-called "AI system" to be "open source", training data
must be "open", largely because it is thought that this is necessary
for users to exercise rights of modification. I don't think that
debate is dispositive of the Fedora question. If a Fedora package
contains pretrained weights, it is not necessarily an assertion that
such a package is "open source" in a precise sense, any more than
Fedora is asserting that firmware packages are "open source". It is
true that Fedora cannot reasonably claim to be 100% FOSS if it
packages stuff like firmware, or "content" under licenses that
prohibit modification.

You might say that holders of those viewpoints in the OSI effort are
adopting a view that model weights are "code", if you map things to
Fedora license approval concepts.

Anyway, I'm struggling to see a justification for not classifying
pretrained weights as "content". I am not sure it is of much practical
significance though.

> 2. If an upstream offers pre-trained weights and indicates that those weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages?

This is what I thought ought to be a FESCo question. If FESCo doesn't
actually care and sees this as a Fedora legal question, then this
question is really equivalent to the first question, isn't it? If it's
"content", and it's under a license acceptable for "content", then as
far as Fedora legal is concerned it can be included in Fedora
packages.

> 3. Extending question 2, is it considered sufficient for an upstream to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable?

So this is where I think we should initially be a little cautious and
look at these things on a case-by-case basis, perhaps until we get
more experience with handling this topic. Maybe there could be
circumstances where given what is disclosed, or not disclosed, about
how a model was trained, we might want to not package the pretrained
weights in Fedora. I think that is unlikely, but not impossible.

> 4. Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are
>     a. For a specific model?
>     b. For a user-defined model which may or may not exist at the time of packaging?
>
>
>
> I can provide examples of any of these situations if that would be helpful.

Can you elaborate on 4a/4b with examples?

Richard
--
_______________________________________________
legal mailing list -- legal@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to legal-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/legal@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue