Re: [PATCH 1/1] Initial draft proposal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Shubham,

On Sun, Apr 10, 2022 at 12:13:50AM +0530, Shubham Mishra wrote:
> Initial draft proposal for reachability bitmap

Thanks so much for your interest in this project and the time you've
spent putting together this proposal. Sorry that it has taken me a few
days to get back to you. Let's take a look...

(I should mention that this shouldn't be reviewed to be picked up by
Junio, so it would be fine to just send the contents of "proposal.txt"
as an email to the list directly instead of a patch. But it's fine to
review it either way, just stating this in case other readers aren't
sure).

> About Me
>
> I am Shubham, I am currently working as a Software Engineer at
> Microsoft India. I am a 2021 graduate from Delhi University. I am
> passionate about core engineering and backend technologies. I love to
> see beyond all abstractions and how things really work under the hood.
> So, I can work from their roots and make things better. I feel
> engineering is all about the tradeoffs that we make and I am trying to
> learn them to become a better Engineer.
>
> I am passionate about open source technologies and have quite a good
> amount of contribution to them, I participated in GSoC 2020 with
> [KDE](https://summerofcode.withgoogle.com/archive/2020/projects/6473982317953024),
> did Internship with Linux Foundation -
> [HDV](https://wiki.lfnetworking.org/display/LN/HDV), Season of KDE -
> [report](https://community.kde.org/SoK/2020/StatusReport/Shubham), and
> I am doing voluntary contributions to VSperf, BoostC++ and some other
> open-source projects.

Great, that all sounds like terrific experience! Typically GSoC only
allows accepted contributors to participate once, but an exception has
been made for participants in the 2020 and 2021 cycles, so you would be
eligible here under those rules.

   https://developers.google.com/open-source/gsoc/faq#what_are_the_eligibility_requirements_for_participation

> Project Abstract
>
> During repository clones, the Git server needs to find out all the
> objects which clients do not have and need to be sent to the client.
>
> To make the process faster, Git uses bitmaps to quickly find all the
> related objects from an object. Bitmap approach is a performance
> optimization over the legacy "Counting Objects" - the process in which
> the git server used to iterate through the graph from branch tips to
> the beginning of history to list down all objects that need to be
> sent.
>
> bitmap made reachability faster but uncompressed bitmaps can cost a
> lot of extra storage. Git uses a C ported version of
> [EWAHBoolArray](https://github.com/lemire/EWAHBoolArray) to compress
> bitmaps which get stored in the ".bitmap" file with the same prefix
> "sha" as ".pack" and ".idx".

Right; there are more than 8.5M objects in my (very outdated copy of
linux.git), so it would be impractical to store an 8.5M-bit array for
each selected commit.

> The aim of the project is to design a performance test suite as well
> as do the necessary changes to improve bitmap performance by trying
> out a new compression scheme that can make read operations along with
> other common operations like intersect, union and negate faster.

Well stated. Like I mentioned in another similar proposal, I think
validating our assumption that the performance of bitmap decompression
is worth pursuing is a good first step. We may find that using a
different compression scheme:

  - makes it faster to read bitmaps (by taking less time to decompress
    them)
  - makes it faster to write them (by taking less time to compress the
    bit-arrays)
  - use less memory for one, the other, or both of the above

, alternatively, we may find that switching EWAH for something else
doesn't make a meaningful dent in any of those areas, in which case we
can find something else to work on, too ;-).

> Me & Git:
>
> ## Microproject:
>
> I worked on the microproject "Avoid pipes in git related commands in
> test scripts", the patches for it has been merged to master now
>
>   * https://public-inbox.org/git/20220224054720.23996-3-shivam828787@xxxxxxxxx/
>   * https://public-inbox.org/git/20220224054720.23996-3-shivam828787@xxxxxxxxx/
>
> I run a pattern matching grep to find all git commands on LHS of pipes
> and fix all of them from file t001-t050.
>
> As an outcome of this process, I got to learn the code review process
> at git work, which is quite cool and different from other
> organization's I contributed to before.
>
> I learned about building source code, running tests, using email to
> send patches, communicating with reviewers and sending the next patch
> version process.

Terrific; I was glad to see that you were able to work on modernizing
those test scripts by removing git invocations on the left-hand side of
a pipe. Hopefully that gave you some good hands-on experience with our
development process, which it sounds like it did.

> ## Current understanding:
>
> - I have gone through git internals, and I well understood about the
>   pack files as well as the difference between git objects (tree, blob,
>   commit).
>
> - I have gone through some documentations - "MyFirstObjectWalk", etc.
>   it was a good hands-on to get some glimpse of general object related
>   tasks.
>
> - I understand how bitmap works in general, I have got some idea how
>   EWAH compression works and also I have gone through the research paper
>   on roaring run.
>
> - I played with commands of pack-object - "git pack-objects dir
>   --progress < obj_lists.txt" and read the code of related files
>   "pack-bitmap.c" and parts of "pack-object.c"

Great; if you are looking for more areas to dig in, I'd recommend
reading:

    - pack-bitmap-writer.c
    - midx.c (the bitmap-related parts)
    - and spending extra time with pack-objects.c (as you mention you
      already have, but this is the best place to look for how
      pack-objects uses bitmaps to accelerate its work).

There are other parts of the code that use bitmaps, too, for some more
straightforward tasks. See for e.g., the parts of builtin/rev-list.c
that match a search for "bitmap".

> Execution plan:
>
> I am interested in keeping my primary focus on "building a performance
> suite and improving bitmaps performance by finding a better
> compression scheme" project and if I finish this early or even after
> the GSoC timeline, I will be happy to contribute to other tasks too.

I think that's a good strategy. Like I mentioned in the other proposal,
too, the goal behind having a few different sub-projects to pick from is
that it gives you some additional flexibility to choose what is most
interesting to you. It definitely isn't meant to suggest that we should
do all of the sub-projects within, since I think any one of them done
right could easily fill up the entire GSoC contribution period.

> [...] I do not have enough knowledge yet to figure out how
> compatible croaring is with the current .bitmap format. We might need
> to make changes in the current .bitmap format accordingly.

Good summary in the parts I elided with the [...] in the quoted section
above.

I think that we will likely want to introduce a new bitmap format. We
*could* use the existing format, and indicate to older readers that we
are using a newer format by incrementing the version identifier and
otherwise leaving the format unchanged.

But I would like to get away from the existing .bitmap format, since the
way we read the hash-cache extension makes it difficult to add things
like a variable-width table of contents. So it would be nice to make the
.bitmap format compatible with the new-ish chunk-format API that Stolee
worked on a year or two ago.

Of course, this will be a not-insignificant project in and of itself. It
also carries some notable downside that JGit won't be able to read the
newer format of bitmaps. We could consider expanding the GSoC project to
help JGit add support for that, but I don't think it's strictly necessary
for you to take that on, since JGit should gracefully ignore the unknown
version number.

> Steps I will be following to accomplish the task-
>
> 1\. Get a better understanding of bitmap related functionalities/
> codebase, EWAH, roaring+run techniques.
>
> 2\. try to build out an initial draft version implementing only
> minimal required core changes, I will try to get a review on it from a
> wider audience (including mentors).
>
> 3\. make changes according to the comment and repeat the review
> process.
>
> 4\. build a performance suite for the changes I have done in the above
> steps.

It may be worth making this occur after the first step in your proposal,
since it removes bias from when you design the performance tests and
keeps us honest about whether or not our changes are worth making.

(Please don't worry about the possibility of replacing EWAH compression
not helping bitmap performance; if it doesn't, there is no shortage of
other things that we can work on in this area ;-)).

> 6\. Until this time, I will also get a good understanding of the
> bitmap related projects, so if we will be able to make good progress
> on roaring+run. I can start picking other subprojects too like 'table
> of contents' for the .bitmap file where past work -
> [https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/](https://lore.kernel.org/git/YNuiM8TR5evSeNsN@nand.local/)
> can act as a good reference to me or/and 'append-only bitmap
> generation' subproject.

I think it makes sense to pick one or the other, since both will be
fairly involved. Like I said earlier, I think we'll have to investigate
a new .bitmap format for the "table of contents" project, so we may want
to dedicate some time to sketching out what that would look like.

> Timeline
>
> I am available to dedicate around 30-35 hours every week to the project.

That seems reasonable to me.

> Community bonding periods -
>
> 1.  Exploring code base mostly related to bitmaps
> 2.  research on other bitmap compression techniques
> 3.  reading technical documents
> 4.  interacting with mentors to understand them and project in more detail.
>
> 12 June - 25 July:
>
> 1.  Writing the first version with the new compression technique
> 2.  Get the initial version reviewed by reviewers and make changes accordingly.
> 3.  write performance test for these changes
>
> 25 July - 4 sept
>
> 1.  if tests result well, extending the above functionality to completely move to a new technique.
> 2.  writing more performance tests.
> 3.  start picking up other tasks if time left
>
> Sept 5 - Sept 12
>
> 1.  I will make sure to get all changes merged before this week including tests
> 2.  if not, make a decision with mentior on extending the project

This looks reasonable-ish to me, though I would suggest a couple of
changes:

  - Dedicate some time to investigating whether or not a new .bitmap
    format is required up front, adjusting the rest of the schedule
    accordingly if it is.

  - Move the writing the performance tests up much earlier, since I
    think it makes sense to start there, rather than to wait until the
    end to start evaluating our changes (and it removes some degree of
    bias, like I mentioned earlier).

I'm not too attached to the particular dates, TBH, and much more
interested in the general pace and priority of things.

> Blog
>
> I want to make blog writing a habit so I planned to publish biweekly
> blogs at [https://shubham828.github.io/](https://shubham828.github.io/)
> during this GSoC and after that. This is something I started during my
> last GSoC too but unfortunately couldn't continue post-gsoc. This GSoC
> gives me another opportunity to become a regular blogger.

Wonderful, I'll look forward to reading your blog posts in this area if
we end up working on it.

> After GSoC
>
> I would love to be a regular contributor of Git. After GSoC, I can
> pick any left out subprojects of bitmap reachability and I would also
> be happy to extend my knowledge beyond bitmaps to learn and contribute
> to other parts of git too.

That's great!

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux