RE: git --archive

"Scheffenegger, Richard" <Richard.Scheffenegger@xxxxxxxxxx> · Thu, 22 Sep 2022 20:35:08 +0000

Hi Junio,

>> Unless I’m mistaken, the procedure to create a zip archive reads like a recursive collection of all relevant objects, and then writing them out sequentially, in a single thread.
>>
>> Is this assessment correct?
>>
>> I was wondering if a highly concurrent fetching phase could be 
>> optionally added…
>
> The details matter here, I think.  Enumerating and slurping down the contents to be archived out of the repository/object store to the core can indeed be made parallel, but the end result product being a zip archive or a tarball, which is fairly a serialized output format, there is only so much you can hold in core, and it is not clear what your plan is to do this without filling all the memory.

"core" (presumably referring to the OS kernel memory for IO caching) is not the only cache in play here. 

As mentioned, the use case would be repositories living on storage systems measuring in at around 500 TB on the filesystem - like what is not uncommon on hyperscalers and large commercial scale development environments.

Those external filesystems (the filesystems do NOT live in "core") tend to have some more or less sophisticated form of tiering, destaging cold (infrequently accessed) data out to high latency devices. Amazon Glacier being an extreme example here (on-demand access tape drives, which need to be fetched by a robot and put into a tape drive when data stored there is needed).

To a more realistic scenario, you may have 100 TB in SSD, and a few PB in spinning rust, externally connected to the primary tier. 

An initial phase to simply fetch in as many objects and sub-trees as possible in parallel (ideally with IOs to the objects issued in a non-sequential order, to promote them and their metadata not to get evicted immediately after the first fetch) would heat up those external caches (to some extent the "core" filesystem page cache too, but that is of minor concern). Thus when the tree is then walked in a fully sequential, recursive way just like currently, the hot metadata, and some hot data would dramatically cut the accumulation of latency. 

A 2nd phase, having the tree fetched in parallel, and sent out serial, would work even better.

Also, at least for ZIP (not so much for TAR), objects residing in different subdirectories can be stored in any order - and only need to be referenced properly in the central directory. Thus whenever a subthread has completed the reading of a (sufficiently small) object to be in (git program) memory, it should be sent immediately to the ZIP writer thread. The result would be that small and hot files (which can be read in quickly) end up at the beginning of the zip file, but the parallel threads can already, in parallel, read-in larger and colder object - the absolute wait time within the worker thread reading those objects may be slightly higher, but as many objects are read in in parallel, the absolute time to create the archive would be minimized.

In short - there is no real need with ZIP for the recursive traversal of the objects and trees to deliver them in-sequence at all (beside, the sequence may be determined by the underlying filesystem, which is not necessarily guaranteed to provide a trivially sorted list of any kind - only that the ordering of files within a directory is stable.

Hope this clarifies the background of this ask.

Best regards,
  Richard