On Sun, Apr 7, 2019 at 7:52 PM Christian Couder <christian.couder@xxxxxxxxx> wrote: > > Hi Matheus > > On Sun, Apr 7, 2019 at 10:48 PM Matheus Tavares Bernardino > <matheus.bernardino@xxxxxx> wrote: > > > > This is my proposal for GSoC with the subject "Make pack access code > > thread-safe". > > Thanks! > > > I'm late in schedule but I would like to ask for your > > comments on it. Any feedback will be highly appreciated. > > > > The "rendered" version can be seen here: > > https://docs.google.com/document/d/1QXT3iiI5zjwusplcZNf6IbYc04-9diziVKdOGkTHeIU/edit?usp=sharing > > Thanks for the link! > > > Besides administrative questions and contributions to FLOSS projects, at > > FLUSP, I’ve been mentoring people who want to start contributing to the > > Linux Kernel and now, to Git, as well. > > Nice! Do you have links about that? Unfortunately not :( Maybe just the mentoring slides (e.g. https://flusp.ime.usp.br/materials/Kernel_Primeiros_Passos.pdf). But they are all in Portuguese, so I don't know wether it would be valuable to add them here... > > # The Project > > > > As direct as possible, the goal with this project is to make more of > > Git’s codebase thread-safe, so that we can improve parallelism in > > various commands. The motivation behind this are the complaints from > > developers experiencing slow Git commands when working with large > > repositories[1], such as chromium and Android. And since nowadays, most > > personal computers have multi-core CPUs, it is a natural step trying to > > improve parallel support so that we can better use the available resources. > > > > With this in mind, pack access code is a good target for improvement, > > since it’s used by many Git commands (e.g., checkout, grep, blame, diff, > > log, etc.). This section of the codebase is still sequential and has > > many global states, which should be protected before we can work to > > improve parallelism. > > I think it's better if global state can be made local or perhaps > removed, rather than protected (though of course that's not always > possible). Indeed! I just added this to the docs version. Thanks > > ## The Pack Access Code > > > > To better describe what the pack access code is, we must talk about > > Git’s object storing (in a simplified way): > > Maybe s/storing/storage/ Thanks. Already changed. > > Besides what are called loose objects, > > s/loose object/loose object files/ Done, thanks! > > Git has a very optimized mechanism to compactly store > > objects (blobs, trees, commits, etc.) in packfiles[2]. These files are > > created by[3]: > > > > 1. listing objects; > > 2. sorting the list with some good heuristics; > > 3. traversing the list with a sliding window to find similar objects in > > the window, in order to do delta decomposing; > > 4. compress the objects with zlib and write them to the packfile. > > > > What we are calling pack access code in this document, is the set of > > functions responsible for retrieving the objects stored at the > > packfiles. This process consists, roughly speaking, in three parts: > > > > 1. Locate and read the blob from packfile, using the index file; > > 2. If the blob is a delta, locate and read the base object to apply the > > delta on top of it; > > 3. Once the full content is read, decompress it (using zlib inflate). > > > > Note: There is a delta cache for the second step so that if another > > delta depends on the same base object, it is already in memory. This > > cache is global; also, the sliding windows, are global per packfile. > > Yeah, but the sliding windows are used only when creating pack files, > not when reading them, right? > > > If these steps were thread-safe, the ability to perform the delta > > reconstruction (together with the delta cache lookup) and zlib inflation > > in parallel could bring a good speedup. At git-blame, for example, > > 24%[4] of the time is spent in the call stack originated at > > read_object_file_extended. Not only this but once we have this big > > section of the codebase thread-safe, we can work to parallelize even > > more work at higher levels of the call stack. Therefore, with this > > project, we aim to make room for many future optimizations in many Git > > commands. > > Nice. > > > # Plan > > > > I will probably be working mainly with packfile.c, sha1-file.c, > > object-store.h, object.c and pack.h, however, I may also need to tackle > > other files. I will be focusing on the following three pack access call > > chains, found in git-grep and/or git-blame: > > > > read_object_file → repo_read_object_file → read_object_file_extended → > > read_object → oid_object_info_extended → find_pack_entry → > > fill_pack_entry → find_pack_entry_one → bsearch_pack and > > nth_packed_object_offset > > > > oid_object_info → oid_object_info_extended → <same as previous> > > > > read_object_with_reference → read_object_file → <same as previous> > > > > Ideally, at the end of the project, it will be possible to call > > read_object_file, oid_object_info and read_object_with_reference with > > thread-safety, so that these operations can be, latter, performed in > > parallel. > > > > Here are some threads on Git’s mailing list where I started discussing > > my project: > > > > * https://public-inbox.org/git/CAHd-oW7onvn4ugEjXzAX_OSVEfCboH3-FnGR00dU8iaoc+b8=Q@xxxxxxxxxxxxxx/ > > * https://public-inbox.org/git/20190402005245.4983-1-matheus.bernardino@xxxxxx/#t > > > > And also, a previous attempt to make part of the pack access code > > thread-safe which I may use as a base: > > > > * https://public-inbox.org/git/20140212015727.1D63A403D3@xxxxxxxxxxxxxxxxxxxxxxxxx/#Z30builtin:gc.c > > Nice. > > > # Points to work on > > > > * Investigate pack access call chains and look for non-thread-safe > > operations on then. > > * Protect packfile.c read-and-write global variables, such as > > pack_open_windows, pack_open_fds and etc., using mutexes. > > Do you want to work on making both packfile reading and packfile > writing thread safe? Or just packfile reading? I plan to work on packfile reading, only. > If some variables are used for both reading and writing packfiles, do > you plan to protect them only when they are used for reading? Hm, I haven't thought of that before. But indeed, if they are used for both, I think I should protect them in both cases. > The rest of your proposal looks very good to me. Please make sure you > upload this or an updated version soon to the GSoC web site. Thanks, Christian. I will work on the final points today and submit it. > Thanks, > Christian.