Re: your mail

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Wed, 24 Jun 2020 01:31:01 +0000

On 2020-06-24 at 00:38:39, shejan shuza wrote:
> Hi, I have a repository with about 55GB of contents, with binary files
> that are less than 100MB each (so no LFS mode) from a project which
> has almost filled up an entire hard drive. I am trying to add all of
> the contents to a git repo and push it to GitHub but every time I do
> 
> git add .
> 
> in the folder with my contents after initializing and setting my
> remote, git starts caching all the files to .git/objects, making the
> .git folder grow in size rapidly. All the files are binaries, so git
> cannot stage changes between versions anyway, so there is no reason to
> cache versions.

What you're experiencing is normal; storing files in the .git directory
is how Git keeps track of them.  It can't rely on the copies in your
working tree because you can modify those files at any time, and if you
did so, relying on them would corrupt the repository.

Also, note that Git can and does deltify changes between revisions once
the data is packed, regardless of whether the file is binary, but how
well it does so depends on your data.  For example, if it's compressed,
it likely doesn't deltify very well, so storing things like compressed
images or zip files using deflate is generally going to result in a
bloated repository.

However, if you don't need versioning for these files, then you don't
need a Git repository.  Git is a tool for versioning files, not a
general storage mechanism.  You may find a cloud storage bucket or some
other artifact storage service may meet your needs better.

I will also tell you from a practical point of view that almost nobody
(including you) will want to host a 55 GB repository filled with binary
blobs.  Usually repacking these repositories is very expensive,
requiring extensive CPU and memory usage for a prolonged time for little
useful benefit.

> Is there any way, such as editing the git attributes or changing
> something about how files are staged in the git repository, to only
> just add indexes or references to files in the repository rather than
> cache them into the .git folder, while also being able to push all the
> data to GitHub?

This is how Git LFS and similar tools, like git-annex, work.  Git LFS
will create copies of the objects in your .git directory though, at
least until they're pushed to the server, at which point they can be
pruned.  Git LFS has the same limitation as Git here.  I'm less familiar
with git-annex, but it is also a popular choice.

However, as mentioned, it sounds like you don't need versioning at all,
so unless you do, Git with Git LFS will be no more suitable for this
than plain Git.  If that's the case, I encourage you to explore
alternate solutions.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
Attachment:
signature.asc

Description: PGP signature