Re: [PATCH] add: add --bulk to index all objects into a pack file

Duy Nguyen <pclouds@xxxxxxxxx> · Thu, 3 Oct 2013 19:26:35 +0700

On Thu, Oct 3, 2013 at 1:43 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@xxxxxxxxx> writes:
>
>> The use case is
>>
>>     tar -xzf bigproject.tar.gz
>>     cd bigproject
>>     git init
>>     git add .
>>     # git grep or something
>
> Two obvious thoughts, and a half.
>
>  (1) This particular invocation of "git add" can easily detect that
>      it is run in a repository with no $GIT_INDEX_FILE yet, which is
>      the most typical case for a big initial import.  It could even
>      ask if the current branch is unborn if you wanted to make the
>      heuristic more specific to this use case.  Perhaps it would
>      make sense to automatically plug the bulk import machinery in
>      such a case without an option?

Yeah! I did not even think of that.

>  (2) Imagine performing a dry-run of update_files_in_cache() using a
>      different diff-files callback that is similar to the
>      update_callback() but that uses the lstat(2) data to see how
>      big an import this really is, instead of calling
>      add_file_to_index(), before actually registering the data to
>      the object database.  If you benchmark to see how expensive it
>      is, you may find that such a scheme might be a workable
>      auto-tuning mechanism to trigger this.  Even if it were
>      moderately expensive, when combined with the heuristics above
>      for (1), it might be a worthwhile thing to do only when it is
>      likely to be an initial import.

We do a lot of lstats nowadays for refreshing index so yeah it's
likely reasonably cheap, but I doubt people do mass update on existing
files often. Adding a large amount of new files (even when .git/index
exists) may be a better indication of an import and we already have
that information from fill_directory().

For the no .git/index case, packing with bulk-checkin probably
produces a reasonably good pack because they don't delta well anyway.
There are no previous versions to delta against. They can delta
against other files but I don't think we'll have good compression with
that. For the case where .git/index exists, we may intefere with "git
gc --auto". We create a less optimal pack, but it's a pack and may
delay gc time longer..

>  (3) Is it always a good idea to send everything to a packfile on a
>      large addition, or are you often better off importing the
>      initial fileset as loose objects?  If the latter, then the
>      option name "--bulk" may give users a wrong hint "if you are
>      doing a bulk-import, you are bettern off using this option".

Hard question. Fewer files are definitely a good thing, for example
when you "rm -rf" the whole thing :-) Disk usage is another. On
gdb-7.3.1, "du -sh" reports .git with loose objects takes 57M, while
the packed one takes 29M. Disk access is slightly faster on packed
.git, at least for "git grep --cached": 0.71s vs 0.83s.

> This is a very logical extension to what was started at 568508e7
> (bulk-checkin: replace fast-import based implementation,
> 2011-10-28), and I like it.  I suspect "--bulk=<threashold>" might
> be a better alternative than setting the threshold unconditionally
> to zero, though.

The threshold may be better in form of a config setting because I
won't want to specify that every time. But does one really want to
keep some small files around in loose format? I don't see it because
my goal is to keep a clean .git, but maybe loose format has some
advantages..
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html