Re: git --archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 24.09.22 um 13:34 schrieb Scheffenegger, Richard:
>
>> If I/O latency instead of CPU usage is the limiting factor and
>> prefetching would help then starting git grep or git archive in the
>> background might work.  If the order of visited blobs needs to be
>> randomized then perhaps something like this would be better:
>>
>> git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file
>> --batch >/dev/null
>
> Isn't the 2nd git, receiving input from stdin, running
> single-threaded?

Yes.

> Maybe
>
> Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4
> - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch >
> /dev/null &"; done; rm -f splitted????
>
> To parallelize the reading of the objects?

Sure, but in a repository with 100000 files you'd end up with 1000
parallel processes, which may be a few too many.  Splitting the list
into similar-sized parts based on a given degree of parallelism is
probably more practical.

It could be done by relying on the randomness of the object IDs and
partitioning by a sub-string.  Or perhaps using pseudo-random numbers
is sufficient:

   git ls-tree -r HEAD |
   awk '{print $3}' |
   sort |
   awk -v pieces=8 -v prefix=file '
      {
         piece = int(rand() * pieces)
         filename = prefix piece
         print $0 > filename
      }'

So how much does such a warmup help in your case?

>> No idea how to randomize the order of tree object visits.
>
> To heat up data caches, the order of objects visited is not relevant,
> the order or IOs issued to the actual object is relevant.

What's the difference?

NB: When I wrote "tree objects" I meant the type of objects from Git's
object store (made up of packs and loose files) that represent
sub-directories, and with "visit" I meant reading them to traverse the
hierarchy of Git blobs and trees.

Here's an idea after all: Using "git ls-tree" without "-r" and handling
recursing in the prefetch script would allow traversing trees in a
different order and even in parallel.  Not sure how to limit parallelism
to a sane degree.

René




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux