Re: [GSoC][RFC] Proposal: Make pack access code thread-safe

Christian Couder <christian.couder@xxxxxxxxx> · Mon, 8 Apr 2019 00:52:08 +0200

Hi Matheus

On Sun, Apr 7, 2019 at 10:48 PM Matheus Tavares Bernardino
<matheus.bernardino@xxxxxx> wrote:
>
> This is my proposal for GSoC with the subject "Make pack access code
> thread-safe".

Thanks!

> I'm late in schedule but I would like to ask for your
> comments on it. Any feedback will be highly appreciated.
>
> The "rendered" version can be seen here:
> https://docs.google.com/document/d/1QXT3iiI5zjwusplcZNf6IbYc04-9diziVKdOGkTHeIU/edit?usp=sharing

Thanks for the link!

> Besides administrative questions and contributions to FLOSS projects, at
> FLUSP, I’ve been mentoring people who want to start contributing to the
> Linux Kernel and now, to Git, as well.

Nice! Do you have links about that?

> # The Project
>
> As direct as possible, the goal with this project is to make more of
> Git’s codebase thread-safe, so that we can improve parallelism in
> various commands. The motivation behind this are the complaints from
> developers experiencing slow Git commands when working with large
> repositories[1], such as chromium and Android. And since nowadays, most
> personal computers have multi-core CPUs, it is a natural step trying to
> improve parallel support so that we can better use the available resources.
>
> With this in mind, pack access code is a good target for improvement,
> since it’s used by many Git commands (e.g., checkout, grep, blame, diff,
> log, etc.). This section of the codebase is still sequential and has
> many global states, which should be protected before we can work to
> improve parallelism.

I think it's better if global state can be made local or perhaps
removed, rather than protected (though of course that's not always
possible).

> ## The Pack Access Code
>
> To better describe what the pack access code is, we must talk about
> Git’s object storing (in a simplified way):

Maybe s/storing/storage/

> Besides what are called loose objects,

s/loose object/loose object files/

> Git has a very optimized mechanism to compactly store
> objects (blobs, trees, commits, etc.) in packfiles[2]. These files are
> created by[3]:
>
> 1. listing objects;
> 2. sorting the list with some good heuristics;
> 3. traversing the list with a sliding window to find similar objects in
> the window, in order to do delta decomposing;
> 4. compress the objects with zlib and write them to the packfile.
>
> What we are calling pack access code in this document, is the set of
> functions responsible for retrieving the objects stored at the
> packfiles. This process consists, roughly speaking, in three parts:
>
> 1. Locate and read the blob from packfile, using the index file;
> 2. If the blob is a delta, locate and read the base object to apply the
> delta on top of it;
> 3. Once the full content is read, decompress it (using zlib inflate).
>
> Note: There is a delta cache for the second step so that if another
> delta depends on the same base object, it is already in memory. This
> cache is global; also, the sliding windows, are global per packfile.

Yeah, but the sliding windows are used only when creating pack files,
not when reading them, right?

> If these steps were thread-safe, the ability to perform the delta
> reconstruction (together with the delta cache lookup) and zlib inflation
> in parallel could bring a good speedup. At git-blame, for example,
> 24%[4] of the time is spent in the call stack originated at
> read_object_file_extended. Not only this but once we have this big
> section of the codebase thread-safe, we can work to parallelize even
> more work at higher levels of the call stack. Therefore, with this
> project, we aim to make room for many future optimizations in many Git
> commands.

Nice.

> # Plan
>
> I will probably be working mainly with packfile.c, sha1-file.c,
> object-store.h, object.c and pack.h, however, I may also need to tackle
> other files. I will be focusing on the following three pack access call
> chains, found in git-grep and/or git-blame:
>
> read_object_file → repo_read_object_file → read_object_file_extended →
> read_object → oid_object_info_extended → find_pack_entry →
> fill_pack_entry → find_pack_entry_one → bsearch_pack and
> nth_packed_object_offset
>
> oid_object_info → oid_object_info_extended → <same as previous>
>
> read_object_with_reference → read_object_file → <same as previous>
>
> Ideally, at the end of the project, it will be possible to call
> read_object_file, oid_object_info and read_object_with_reference with
> thread-safety, so that these operations can be, latter, performed in
> parallel.
>
> Here are some threads on Git’s mailing list where I started discussing
> my project:
>
> * https://public-inbox.org/git/CAHd-oW7onvn4ugEjXzAX_OSVEfCboH3-FnGR00dU8iaoc+b8=Q@xxxxxxxxxxxxxx/
> * https://public-inbox.org/git/20190402005245.4983-1-matheus.bernardino@xxxxxx/#t
>
> And also, a previous attempt to make part of the pack access code
> thread-safe which I may use as a base:
>
> * https://public-inbox.org/git/20140212015727.1D63A403D3@xxxxxxxxxxxxxxxxxxxxxxxxx/#Z30builtin:gc.c

Nice.

> # Points to work on
>
> * Investigate pack access call chains and look for non-thread-safe
> operations on then.
> * Protect packfile.c read-and-write global variables, such as
> pack_open_windows, pack_open_fds and etc., using mutexes.

Do you want to work on making both packfile reading and packfile
writing thread safe? Or just packfile reading?

If some variables are used for both reading and writing packfiles, do
you plan to protect them only when they are used for reading?

The rest of your proposal looks very good to me. Please make sure you
upload this or an updated version soon to the GSoC web site.

Thanks,
Christian.