On Thu, Oct 3, 2013 at 1:43 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> writes: > >> The use case is >> >> tar -xzf bigproject.tar.gz >> cd bigproject >> git init >> git add . >> # git grep or something > > Two obvious thoughts, and a half. > > (1) This particular invocation of "git add" can easily detect that > it is run in a repository with no $GIT_INDEX_FILE yet, which is > the most typical case for a big initial import. It could even > ask if the current branch is unborn if you wanted to make the > heuristic more specific to this use case. Perhaps it would > make sense to automatically plug the bulk import machinery in > such a case without an option? Yeah! I did not even think of that. > (2) Imagine performing a dry-run of update_files_in_cache() using a > different diff-files callback that is similar to the > update_callback() but that uses the lstat(2) data to see how > big an import this really is, instead of calling > add_file_to_index(), before actually registering the data to > the object database. If you benchmark to see how expensive it > is, you may find that such a scheme might be a workable > auto-tuning mechanism to trigger this. Even if it were > moderately expensive, when combined with the heuristics above > for (1), it might be a worthwhile thing to do only when it is > likely to be an initial import. We do a lot of lstats nowadays for refreshing index so yeah it's likely reasonably cheap, but I doubt people do mass update on existing files often. Adding a large amount of new files (even when .git/index exists) may be a better indication of an import and we already have that information from fill_directory(). For the no .git/index case, packing with bulk-checkin probably produces a reasonably good pack because they don't delta well anyway. There are no previous versions to delta against. They can delta against other files but I don't think we'll have good compression with that. For the case where .git/index exists, we may intefere with "git gc --auto". We create a less optimal pack, but it's a pack and may delay gc time longer.. > (3) Is it always a good idea to send everything to a packfile on a > large addition, or are you often better off importing the > initial fileset as loose objects? If the latter, then the > option name "--bulk" may give users a wrong hint "if you are > doing a bulk-import, you are bettern off using this option". Hard question. Fewer files are definitely a good thing, for example when you "rm -rf" the whole thing :-) Disk usage is another. On gdb-7.3.1, "du -sh" reports .git with loose objects takes 57M, while the packed one takes 29M. Disk access is slightly faster on packed .git, at least for "git grep --cached": 0.71s vs 0.83s. > This is a very logical extension to what was started at 568508e7 > (bulk-checkin: replace fast-import based implementation, > 2011-10-28), and I like it. I suspect "--bulk=<threashold>" might > be a better alternative than setting the threshold unconditionally > to zero, though. The threshold may be better in form of a config setting because I won't want to specify that every time. But does one really want to keep some small files around in loose format? I don't see it because my goal is to keep a clean .git, but maybe loose format has some advantages.. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html