On Sat, Nov 12, 2022 at 10:21:35AM -0800, Thiago Macieira wrote: > Not exactly. That's what this file is there for. It allows the algorithm to > read the current batch file, add 1, then echo back. If the load succeeds, the > the batch exists; if not, then the algorithm should simply go back to 0. This sounds to me like there's a special order in which those batches should be executed? I thought they're simply collections of test sequences which can be run in any order... > First, there's the question of the ability to see into /lib/firmware. I'm not a > kernel dev but I'm told that request_firmware() only operates on the root > container's filesystem view. We're expecting that the application may get > deployed as a container (with full privileges so it can write to /sys, sure), > so it won't be able to see the host system's /lib to know what files are > available. It could "guess" at the file names, based on the current processor's > family/model/stepping and a natural number, but that's sub-optimal. It is not about seeing - you simply give it the filename - request_firmware* does the "seeing". Either the file's there or it isn't. > Unless the driver were allowed to load any file named by the application, from > its own view of the filesystem, permitting the firmware files being distributed > inside the container. There's a reason I wrote: "There will be no requirement on the naming - only on the filename length and it should be in that directory /lib/firmware/intel/ifs_0/" Of course the driver should load only from that directory. > Second, for electrical reasons, we expect that certain processor generations > will need a timeout between tests before testing can be done again on a given > core, whether the same batch or the next one. This time out can be in the > order of many minutes, which is longer than any hyperscaler is willing to > allocate for a system self-test hogging a core or the whole system, just > waiting. For example, let's say that the timeout is 15 minutes and there are 4 > batches: this means the whole testing procedure takes one hour, even though > the actual downtime for each core was less than 1 second. This is lost > revenue. All that doesn't matter - if the CPU *must* wait 15 minutes between batches, then that should be enforced by the driver and not relied upon by userspace to DTRT. > Instead, they wish the next available maintenance window to simply resume > testing at the point where the last one stopped. These windows need not be > scheduled; they can also be opportunistic, when the orchestrator determines > the machine or a subset of one is going to be idle. That's what the algorithm > in the pull request above implements: if the current_batch's result was > "untested", it is attempted again, otherwise it tries the next one, rolling > back to 0 if the loading failed. This removes the need to know anything about > the timeout on the current processor or even whether there is one, or how many > batches there are.242 This all has nothing to do with whether you give it a number or a filename. How you glue your testing around it together is a userspace issue - all the kernel driver needs to be able to do is load the sequence and execute it. Echoing filenames into sysfs is no different from echoing numbers into it - former is simpler. If the CPU says it cannot execute the sequence currently, you have to think about how you retry that sequence. How you specify it doesn't matter. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette