On 2022-09-22 at 20:35:08, Scheffenegger, Richard wrote: > Also, at least for ZIP (not so much for TAR), objects residing in > different subdirectories can be stored in any order - and only need to > be referenced properly in the central directory. Thus whenever a > subthread has completed the reading of a (sufficiently small) object > to be in (git program) memory, it should be sent immediately to the > ZIP writer thread. The result would be that small and hot files (which > can be read in quickly) end up at the beginning of the zip file, but > the parallel threads can already, in parallel, read-in larger and > colder object - the absolute wait time within the worker thread > reading those objects may be slightly higher, but as many objects are > read in in parallel, the absolute time to create the archive would be > minimized. Maybe they can technically be stored in any order, but people don't want git archive to produce non-deterministic archives. I'm one of the folks responsible for the service at GitHub that serves archives (which uses git archive under the hood) and people become very unhappy when the archives are not bit-for-bit identical, even though neither Git nor GitHub guarantee that. That's because people want to use those archives with cryptographic hashes like SHA-256, and if the file changes, the hash breaks. (We tell them to generate a tarball as part of the release process and upload it as a release asset instead.) What Git does implicitly guarantee is that the result is deterministic: that is, given the same repository and the same version of Git, that the archive is identical. The encoding may change across versions, but not within a version. I feel like it would be very difficult to achieve the speedups you want and still produce a deterministic archive. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature