Re: AI/ML Model and Pre-Trained Weight Packaging in Fedora

Tim Flink <tflink@xxxxxxxxxxxxxxxxx> · Fri, 1 Mar 2024 15:38:22 -0700

On 3/1/24 15:32, Tim Flink wrote:

On 3/1/24 14:54, Neal Gompa wrote:
On Fri, Mar 1, 2024 at 4:52 PM Tim Flink <tflink@xxxxxxxxxxxxxxxxx> 
wrote:

On 2/28/24 19:03, Richard Fontana wrote:
On Tue, Feb 27, 2024 at 5:58 PM Tim Flink <tflink@xxxxxxxxxxxxxxxxx> 
wrote:

On 2/26/24 19:06, Richard Fontana wrote:

<snip>

4. Is it acceptable to package code which downloads pre-trained 
weights from a non-Fedora source upon first use post-installation 
by a user if that model and its associated weights are
       a. For a specific model?

What do you mean by "upon first use post-installation"? Does that mean
I install the package, and the first time I launch it or whatever, it
automatically downloads some set of pre-trained weights, or is this
something that would be controlled by the user? The example you gave
suggests the latter but I wasn't sure if I was misunderstanding.

Once the package is installed, pre-trained weights would downloaded 
if and only if code written to use a specific model with pre-trained 
weights is run. In the cases I'm aware of, code that would cause the 
weights to be downloaded is not directly part of the packaged 
libraries and anything that could trigger the downloading of 
pre-trained weights would have to be written by a user or contained 
in a separate package. If a specific model with pre-trained weights 
is not used and not executed by another library/application, the 
weights will not be downloaded. With the ViT example, the vitb16 
weights would be downloaded when that code (not included in the 
package) is run but the vitb32 weights would not be downloaded unless 
the example was changed or something else specified a pre-trained ViT 
model with the vitb32 weights. Similarly, the weights for other 
models (googlenet, as an example) would not be downloaded unless code 
that uses that specific model in its pre-trained form is executed 
post-installation.

The implementations that I'm familiar with will check for downloaded 
weights as the code is initialized. When done in this way, the 
download is transparent to the user and unless code using these 
models/weights is written in such a way that the user a choice, there 
is not much a user could do to change the download URL or prevent the 
weights from being downloaded. The only ways I can think of off hand 
would be to modify the underlying libraries to override the 
hard-coded URLs or maybe put identically named files in the cache 
location but that would end up being dependant on model 
implementation. For the specific libraries I used as examples, I 
don't know what the local download folder is off the top of my head, 
nor do I know if they do any verification of downloads so putting 
files into the cached location may not work if they don't match the 
intended file contents.

This is just my opinion but I doubt that many people writing code 
that uses pre-trained models are going to go out of their way to help 
users avoid downloading pre-trained weights. I know that for code 
that I've written using pre-trained models, it might be able to 
execute without the pre-trained weights but the output would just be 
noise in that situation. I would have a hard time justifying the work 
needed to make those downloads optional since it would make the code 
useless for what it was intended to do.

It may also be worth noting that some models with pre-trained weights 
are almost useless without those weights. For some (mostly older) 
models, it's feasible to train a model from scratch but for many of 
the recent models, it's just not feasible. As an example, the weights 
for Meta's Llama 2 took 3.3 million hours of GPU time to train [1] 
with a cost into the millions of USD ignoring what it would take to 
obtain enough data to train a model that large.

Apologies for my verbosity but I hope that I answered your question 
and the extra bits weren't entirely useless.

This sounds like it falls in the same bucket as pip, snapd, gem, and
other similar "package manager" functionality.

Yeah, the capabilities do overlap but in my opinion, the intended uses 
are different and that may be worth noting.

pip, as an example is intended to allow users to install python packages 
sourced from outside Fedora repos. I don't believe that software which 
used pip after installation with no direct user interaction would be 
allowed in Fedora.

The pre-trained models that I'm familiar with, however, download things 
transparently to the user with no warning outside of a log message when 
the weights are first downloaded.

As an example, I wrote some code called openqa_classifier [1] to test 
the possibility of identifying OpenQA [2] test failures as duplicates of 
a long running issue. The code was written only to run an experiment so 
I wouldn't package it in its current form but for the sake of argument, 
let's say that I packaged it in its current form. Only one of the 
experiments is relevant here - the one that looks at whether existing, 
more sophisticated pre-trained models can outperform a simple custom 
model trained from scratch.

If you installed openqa_classifier (pretending that the data was already 
available and that I created a sane entry point for the cli) and ran 
'openqa_classifier train torch', that command would almost immediately 
download pre-trained weights from a URL that's hardcoded in the 
torchvision module if those weights didn't already exist locally. The 
only user-facing indication that this had happend would be a few lines 
in the cli output and some new files on disk.

Extending this example to the not-hardcoded-in-packaged-code variety, running 'openqa_classifier train huggingface' would almost immediately download model specifications from huggingface.co and whatever pre-trained weights those specifications currently point to.

An example of the pre-trained models that code uses is https://huggingface.co/microsoft/swinv2-large-patch4-window12-192-22k

Tim

I'm not arguing against including code which could download pre-trained 
weights but I do want to be reasonably sure that I've explained all this 
correctly.

Tim

[1] https://pagure.io/fedora-qa/openqa_classifier
[2] https://openqa.fedoraproject.org/
--
_______________________________________________
legal mailing list -- legal@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to legal-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/legal@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue