RE: Parallelism for submodule update

"Zitzmann, Christian" <Christian.Zitzmann@xxxxxxxxxxx> · Fri, 13 Jan 2023 10:49:34 +0000

Hello Randall,
Yes, I guess this is a quite common that the harddisk is much faster than the Network Services.
With the scalar strategy (e.g. blobless clones) the checkout phase does not involve mainly harddisk activity anymore, but includes fetching sources from the remote. So it consumes a lot of Network Services.
Especially with parallelism we would gain a lot of performance here, as network and harddisk are utilized in parallel.
This could be even a general strategy (without using blobless clones) to have fetch and checkout done together, but both in an parallel Scheme

Currently it's like this:

Multithreading (mainly network utilization, only small amount of data - Commits and Trees -)
	Thread1: Fetch submodule1                       -> NETWORK
	Thread2: Fetch submodule2                       -> NETWORK
	---
	Thread<x>: Fetch Submodule<n>               -> NETWORK

Sequential (alternating harddisk and network utilization)
	Loop1
		Try to Checkout Submodule1 commit                                      -> HARDDISK
		Fetch missing objects (e.g. Blobs - big amount of Data)        -> NETWORK
		Checkout Submodule commit                                                    -> HARDDISK
	Loop2		
		Try to Checkout Submodule1 commit                                      -> HARDDISK
		Fetch missing objects (e.g. Blobs - big amount of Data)        -> NETWORK
		Checkout Submodule commit                                                    -> HARDDISK

		...
	Loop<n>
		Try to Checkout Submodule<n> commit                                  -> HARDDISK
		Fetch missing objects (e.g. Blobs - big amount of Data)        -> NETWORK
		Checkout Submodule commit                                                    -> HARDDISK

Here the Network accesses in the sequential part really have significant waiting times (e.g. name service) with low local resources utilization

The proposal is to change it to a full parallel flow!

Multithreading (both network and harddisk are utilized all the time)
	Thread1:
		Fetch submodule1 (blobless)               -> NETWORK
		Try to Checkout Submodule commit   -> HARDDISK
		Fetch missing objects                            -> NETWORK
		Checkout Submodule commit              -> HARDDISK
	Thread2:
		Fetch submodule2 (blobless)                -> NETWORK
		Try to Checkout Submodule commit   -> HARDDISK
		Fetch missing objects                            -> NETWORK
		Checkout Submodule commit...           -> HARDDISK
	...

	Thread<x>

The only negative effect I'd see when having very slow harddisks, or disks that suffer significantly from parallel access, the overall performance could also suffer.

In general in the partial clone, but even in the full clone approach, network and harddisk utilization will be in parallel, and therefore performance can increase.

Best regards

Christian

-----Original Message-----
From: rsbecker@xxxxxxxxxxxxx <rsbecker@xxxxxxxxxxxxx> 
Sent: Montag, 2. Januar 2023 17:54
To: Zitzmann, Christian <Christian.Zitzmann@xxxxxxxxxxx>; git@xxxxxxxxxxxxxxx
Subject: RE: Parallelism for submodule update 

[Sie erhalten nicht häufig E-Mails von rsbecker@xxxxxxxxxxxxx. Weitere Informationen, warum dies wichtig ist, finden Sie unter https://aka.ms/LearnAboutSenderIdentification ]

>-----Original Message-----
>From: <Christian.Zitzmann@xxxxxxxxxxx>
On January 2, 2023 11:45 AM Christian Zitzmann wrote:
>we are using git since many years with also heavily using submodules.
>
>When updating the submodules, only the fetching part is done in 
>parallel (with config submodule.fetchjobs or --jobs) but the checkout 
>is done sequentially
>
>What I’ve recognized when cloning with
>- scalar clone --full-clone --recurse-submodules <URL> or
>- git clone --filter=blob:none --also-filter-submodules 
>--recurse-submodules <URL>
>
>We loose performance, as the fetch of the blobs is done in the 
>sequential checkout part, instead of in the parallel part.
>
>Furthermore, the utilization - without partial clone - of network and 
>harddisk is not always good, as first the network is utilized (fetch) 
>and then the harddisk
>(checkout)
>
>As the checkout part is local to the submodule (no shared resources to 
>block), it would be great if we could move the checkout into the parallelized part.
>E.g. by doing fetch and checkout (with blob fetching) in one step with e.g.
>run_processes_parallel_tr2
>
>I expect that this significantly improves the performance, especially 
>when using partial clones.
>
>Do you think this is possible? Do I miss anything in my thoughts?

Since this is a platform-specific request, if it happens, this should be a configuration switch that defaults off. On my platform, the file system itself is fairly fast, but the name service traversals and resolutions (what happens in the name service) is a performance problem. Doing the checkout/switch in parallel would actually be counter-productive in my case. So I would keep it off, but I get that other platforms could benefit.

Regards,
Randall

--
Brief whoami: NonStop&UNIX developer since approximately
UNIX(421664400)
NonStop(211288444200000000)
-- In real life, I talk too much.