Re: [PATCH v2 4/4] git-p4: add support for large file systems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03 Sep 2015, at 22:03, Luke Diamand <luke@xxxxxxxxxxx> wrote:

> On 03/09/15 17:35, larsxschneider@xxxxxxxxx wrote:
>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>> 
>> Perforce repositories can contain large (binary) files. Migrating these
>> repositories to Git generates very large local clones. External storage
>> systems such as LFS [1] or git-annex [2] try to address this problem.
>> 
>> Add a generic mechanism to detect large files based on extension,
>> uncompressed size, and/or compressed size. Add LFS as example
>> implementation.
> 
> Can you split this into "add mechanism for large file support" and then "add LFS implementation".
> 
> It will make it easier to review if nothing else.
Fun fact: I had it that way first and then decided against it. I split it this way:

commit 1: generic impl + docs
commits 2: lfs impl + TC

OK?

> 
> Some other comments inline.
> 
> Thanks!
> Luke
> 
>> 
>> [1] https://git-lfs.github.com/
>> [2] http://www.git-annex.org/
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
>> ---
>>  Documentation/git-p4.txt |  21 ++++
>>  git-p4.py                | 126 +++++++++++++++++++++--
>>  t/t9823-git-p4-lfs.sh    | 263 +++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 401 insertions(+), 9 deletions(-)
>>  create mode 100755 t/t9823-git-p4-lfs.sh
>> 
>> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
>> index 82aa5d6..eac5bad 100644
>> --- a/Documentation/git-p4.txt
>> +++ b/Documentation/git-p4.txt
>> @@ -510,6 +510,27 @@ git-p4.useClientSpec::
>>  	option '--use-client-spec'.  See the "CLIENT SPEC" section above.
>>  	This variable is a boolean, not the name of a p4 client.
>> 
>> +git-p4.largeFileSystem::
>> +	Specify the system that is used used for large (binary) files. Only
>> +	"LFS" [1] is supported right now. Download and install the Git LFS
>> +	command line extension to use this option.
>> +	[1] https://git-lfs.github.com/
>> +
>> +git-p4.largeFileExtensions::
>> +	All files matching a file extension in the list will be processed
>> +	by the large file system. Do not prefix the extensions with '.'.
>> +
>> +git-p4.largeFileThreshold::
>> +	All files with an uncompressed size exceeding the threshold will be
>> +	processed by the large file system. By default the threshold is
>> +	defined in bytes. Add the suffix k, m, or g to change the unit.
>> +
>> +git-p4.largeFileCompressedThreshold::
>> +	All files with an compressed size exceeding the threshold will be
> 
> s/an compressed/a compressed/
ok

> 
>> +	processed by the large file system. This option might significantly
>> +	slow down your clone/sync process. By default the threshold is
>> +	defined in bytes. Add the suffix k, m, or g to change the unit.
>> +
>>  Submit variables
>>  ~~~~~~~~~~~~~~~~
>>  git-p4.detectRenames::
>> diff --git a/git-p4.py b/git-p4.py
>> index 4d78e1c..cde75a5 100755
>> --- a/git-p4.py
>> +++ b/git-p4.py
>> @@ -22,6 +22,8 @@ import platform
>>  import re
>>  import shutil
>>  import stat
>> +import zipfile
>> +import zlib
>> 
>>  try:
>>      from subprocess import CalledProcessError
>> @@ -922,6 +924,51 @@ def wildcard_present(path):
>>      m = re.search("[*#@%]", path)
>>      return m is not None
>> 
>> +def largeFileSystem():
>> +    try:
>> +        return getattr(sys.modules[__name__], gitConfig('git-p4.largeFileSystem'))
>> +    except AttributeError as e:
>> +        die('Large file system not supported: %s' % gitConfig('git-p4.largeFileSystem'))
>> +
>> +class LFS:
> 
> Docstrings?
OK.

> 
> 
>> +    @staticmethod
>> +    def description():
>> +        return 'LFS (see https://git-lfs.github.com/)'
>> +
>> +    @staticmethod
>> +    def attributeFilter():
>> +        return 'lfs'
>> +
>> +    @staticmethod
>> +    def generatePointer(cloneDestination, relPath, contents):
>> +        # Write P4 content to temp file
>> +        p4ContentTempFile = tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False)
> 
> delete=False, doesn't that mean that if anything goes wrong, we leave large files lying around? Perhaps that doesn't matter, I don't know.
delete=True means the file is deleted as soon as it is closed
source: https://docs.python.org/2/library/tempfile.html
I don’t close the file (sounds like a bug?!) so in theory delete=True would work. However, I am moving the file a few lines below.

> 
>> +        for d in contents:
>> +            p4ContentTempFile.write(d)
>> +        p4ContentTempFile.flush()
> 
> These seems like behaviour that could live in the base class or the main git-p4 code. It doesn't look LFS-specific.
ok

> 
>> +
>> +        # Generate LFS pointer file based on P4 content
>> +        lfsProcess = subprocess.Popen(
>> +            ['git', 'lfs', 'pointer', '--file=' + p4ContentTempFile.name],
>> +            stdout=subprocess.PIPE
>> +        )
>> +        lfsPointerFile = lfsProcess.stdout.read()
>> +        if lfsProcess.wait():
>> +            os.remove(p4ContentTempFile.name)
>> +            die('git-lfs command failed. Did you install the extension?')
>> +        contents = [i+'\n' for i in lfsPointerFile.split('\n')[2:][:-1]]
>> +
>> +        # Write P4 content to LFS
>> +        oid = contents[1].split(' ')[1].split(':')[1][:-1]
>> +        oidPath = os.path.join(cloneDestination, '.git', 'lfs', 'objects', oid[:2], oid[2:4])
>> +        if not os.path.isdir(oidPath):
>> +            os.makedirs(oidPath)
>> +        shutil.move(p4ContentTempFile.name, os.path.join(oidPath, oid))
> 
> This also does not look LFS-specific.
ok

Thank you! One thing I missed to noticed in the commit message:
Right now the LFS uploads need to be executed manually. I will automate this with git-p4 in v3.

Do you want to test this feature against a real backend? In that case you would need a LFS enabled GitHub account. If you don’t have one, maybe _Jeff King_ can help?

Thanks,
Lars


> 
>> +
>> +        # LFS Spec states that pointer files should not have the executable bit set.
>> +        gitMode = '100644'
>> +        return (gitMode, contents)
>> +
>>  class Command:
>>      def __init__(self):
>>          self.usage = "usage: %prog [options]"
>> @@ -2038,6 +2085,7 @@ class P4Sync(Command, P4UserMap):
>>          self.clientSpecDirs = None
>>          self.tempBranches = []
>>          self.tempBranchLocation = "git-p4-tmp"
>> +        self.largeFiles = []
>> 
>>          if gitConfig("git-p4.syncFromOrigin") == "false":
>>              self.syncWithOrigin = False
>> @@ -2158,6 +2206,59 @@ class P4Sync(Command, P4UserMap):
>> 
>>          return branches
>> 
>> +    def writeToGitStream(self, gitMode, relPath, contents):
>> +        self.gitStream.write('M %s inline %s\n' % (gitMode, relPath))
>> +        self.gitStream.write('data %d\n' % sum(len(d) for d in contents))
>> +        for d in contents:
>> +            self.gitStream.write(d)
>> +        self.gitStream.write('\n')
>> +
>> +    def writeGitAttributesToStream(self):
>> +        self.writeToGitStream(
>> +            '100644',
>> +            '.gitattributes',
>> +            [
>> +                '\n',
>> +                '#\n',
>> +                '# %s\n' % largeFileSystem().description(),
>> +                '#\n',
>> +            ] +
>> +            ['*.' + f.replace(' ', '[:space:]') + ' filter=%s -text\n' % largeFileSystem().attributeFilter()
>> +                for f in gitConfigList("git-p4.largeFileExtensions")
>> +            ] +
>> +            ['/' + f.replace(' ', '[:space:]') + ' filter=%s -text\n' % largeFileSystem().attributeFilter()
>> +                for f in self.largeFiles if not self.hasLargeFileExtension(f)
>> +            ]
>> +        )
>> +
>> +    def hasLargeFileExtension(self, relPath):
>> +        return reduce(
>> +            lambda a, b: a or b,
>> +            [relPath.endswith('.' + e) for e in gitConfigList('git-p4.largeFileExtensions')],
>> +            False
>> +        )
>> +
>> +    def exceedsLargeFileThreshold(self, relPath, contents):
>> +        if gitConfigInt('git-p4.largeFileThreshold'):
>> +            contentsSize = sum(len(d) for d in contents)
>> +            if contentsSize > gitConfigInt('git-p4.largeFileThreshold'):
>> +                return True
>> +        if gitConfigInt('git-p4.largeFileCompressedThreshold'):
>> +            contentFile = tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False)
>> +            for d in contents:
>> +                contentFile.write(d)
>> +            contentFile.flush()
>> +            compressedContentFile = tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False)
>> +            zf = zipfile.ZipFile(compressedContentFile.name, mode='w')
>> +            zf.write(contentFile.name, compress_type=zipfile.ZIP_DEFLATED)
>> +            zf.close()
>> +            compressedContentsSize = zf.infolist()[0].compress_size
>> +            os.remove(contentFile.name)
>> +            os.remove(compressedContentFile.name)
>> +            if compressedContentsSize > gitConfigInt('git-p4.largeFileCompressedThreshold'):
>> +                return True
>> +        return False
>> +
>>      # output one file from the P4 stream
>>      # - helper for streamP4Files
>> 
>> @@ -2226,17 +2327,20 @@ class P4Sync(Command, P4UserMap):
>>              text = regexp.sub(r'$\1$', text)
>>              contents = [ text ]
>> 
>> -        self.gitStream.write("M %s inline %s\n" % (git_mode, relPath))
>> +        if relPath == '.gitattributes':
>> +            die('.gitattributes already exists in P4.')
>> 
>> -        # total length...
>> -        length = 0
>> -        for d in contents:
>> -            length = length + len(d)
>> +        if (gitConfig('git-p4.largeFileSystem') and
>> +            (   self.exceedsLargeFileThreshold(relPath, contents) or
>> +                self.hasLargeFileExtension(relPath)
>> +            )):
>> +            (git_mode, contents) = largeFileSystem().generatePointer(self.cloneDestination, relPath, contents)
>> +            self.largeFiles.append(relPath)
>> +            self.writeGitAttributesToStream()
>> +            if verbose:
>> +                sys.stderr.write("%s added to large file system\n" % relPath)
>> 
>> -        self.gitStream.write("data %d\n" % length)
>> -        for d in contents:
>> -            self.gitStream.write(d)
>> -        self.gitStream.write("\n")
>> +        self.writeToGitStream(git_mode, relPath, contents)
>> 
>>      def streamOneP4Deletion(self, file):
>>          relPath = self.stripRepoPath(file['path'], self.branchPrefixes)
>> @@ -2244,6 +2348,10 @@ class P4Sync(Command, P4UserMap):
>>              sys.stderr.write("delete %s\n" % relPath)
>>          self.gitStream.write("D %s\n" % relPath)
>> 
>> +        if relPath in self.largeFiles:
>> +            self.largeFiles.remove(relPath)
>> +            self.writeGitAttributesToStream()
>> +
>>      # handle another chunk of streaming data
>>      def streamP4FilesCb(self, marshalled):
>> 
>> diff --git a/t/t9823-git-p4-lfs.sh b/t/t9823-git-p4-lfs.sh
>> new file mode 100755
>> index 0000000..a7587bf
>> --- /dev/null
>> +++ b/t/t9823-git-p4-lfs.sh
>> @@ -0,0 +1,263 @@
>> +#!/bin/sh
>> +
>> +test_description='Clone repositories and store files in LFS'
>> +
>> +git lfs help >/dev/null 2>&1 || {
>> +	skip_all='skipping git p4 LFS tests; no git lfs'
>> +	test_done
>> +}
>> +
>> +. ./lib-git-p4.sh
>> +
>> +test_file_in_lfs() {
>> +	FILE=$1
>> +	SIZE=$2
>> +	EXPECTED_CONTENT=$3
>> +	test_path_is_file $FILE &&
>> +	test_line_count = 3 $FILE &&
>> +	cat $FILE | grep "size $SIZE" &&
>> +	HASH=$(cat $FILE | grep "oid sha256:" | sed -e 's/oid sha256://g') &&
>> +	LFS_FILE=".git/lfs/objects/${HASH:0:2}/${HASH:2:2}/$HASH"
>> +	test_path_is_file $LFS_FILE &&
>> +	echo $EXPECTED_CONTENT >expect
>> +	test_cmp expect $LFS_FILE
>> +}
>> +
>> +test_expect_success 'start p4d' '
>> +	start_p4d
>> +'
>> +
>> +test_expect_success 'Create repo with binary files' '
>> +	client_view "//depot/... //client/..." &&
>> +	(
>> +		cd "$cli" &&
>> +
>> +		echo "txt 7b" >file1.txt &&
>> +		p4 add file1.txt &&
>> +		echo "bin 13 bytes" >file2.dat &&
>> +		p4 add file2.dat &&
>> +		p4 submit -d "Add text and binary file" &&
>> +
>> +		mkdir "path with spaces" &&
>> +		echo "bin 13 bytes" >"path with spaces/file3.bin" &&
>> +		p4 add "path with spaces/file3.bin" &&
>> +		p4 submit -d "Add another binary file with same content and spaces in path" &&
>> +
>> +		echo "bin 14 bytesx" >file4.bin &&
>> +		p4 add file4.bin &&
>> +		p4 submit -d "Add another binary file with different content"
>> +	)
>> +'
>> +
>> +test_expect_success 'Store files in LFS based on size (>12 bytes)' '
>> +	client_view "//depot/... //client/..." &&
>> +	test_when_finished cleanup_git &&
>> +	(
>> +		cd "$git" &&
>> +		git init . &&
>> +		git config git-p4.largeFileSystem LFS &&
>> +		git config git-p4.largeFileThreshold 12 &&
>> +		git p4 clone --use-client-spec --destination="$git" //depot@all &&
>> +
>> +		test_file_in_lfs file2.dat 13 "bin 13 bytes" &&
>> +		test_file_in_lfs "path with spaces/file3.bin" 13 "bin 13 bytes" &&
>> +		test_file_in_lfs file4.bin 14 "bin 14 bytesx" &&
>> +
>> +		find ".git/lfs/objects" -type f >actual &&
>> +		test_line_count = 2 actual &&
>> +
>> +		cat >expect <<-\EOF &&
>> +
>> +		#
>> +		# LFS (see https://git-lfs.github.com/)
>> +		#
>> +		/file2.dat filter=lfs -text
>> +		/path[:space:]with[:space:]spaces/file3.bin filter=lfs -text
>> +		/file4.bin filter=lfs -text
>> +		EOF
>> +		test_path_is_file .gitattributes &&
>> +		test_cmp expect .gitattributes
>> +	)
>> +'
>> +
>> +test_expect_success 'Store files in LFS based on size (>13 bytes)' '
>> +	client_view "//depot/... //client/..." &&
>> +	test_when_finished cleanup_git &&
>> +	(
>> +		cd "$git" &&
>> +		git init . &&
>> +		git config git-p4.largeFileSystem LFS &&
>> +		git config git-p4.largeFileThreshold 13 &&
>> +		git p4 clone --use-client-spec --destination="$git" //depot@all &&
>> +
>> +		test_file_in_lfs file4.bin 14 "bin 14 bytesx" &&
>> +
>> +		find ".git/lfs/objects" -type f >actual &&
>> +		test_line_count = 1 actual &&
>> +
>> +		cat >expect <<-\EOF &&
>> +
>> +		#
>> +		# LFS (see https://git-lfs.github.com/)
>> +		#
>> +		/file4.bin filter=lfs -text
>> +		EOF
>> +		test_path_is_file .gitattributes &&
>> +		test_cmp expect .gitattributes
>> +	)
>> +'
>> +
>> +test_expect_success 'Store files in LFS based on extension (dat)' '
>> +	client_view "//depot/... //client/..." &&
>> +	test_when_finished cleanup_git &&
>> +	(
>> +		cd "$git" &&
>> +		git init . &&
>> +		git config git-p4.largeFileSystem LFS &&
>> +		git config git-p4.largeFileExtensions dat &&
>> +		git p4 clone --use-client-spec --destination="$git" //depot@all &&
>> +
>> +		test_file_in_lfs file2.dat 13 "bin 13 bytes" &&
>> +
>> +		find ".git/lfs/objects" -type f >actual &&
>> +		test_line_count = 1 actual &&
>> +
>> +		cat >expect <<-\EOF &&
>> +
>> +		#
>> +		# LFS (see https://git-lfs.github.com/)
>> +		#
>> +		*.dat filter=lfs -text
>> +		EOF
>> +		test_path_is_file .gitattributes &&
>> +		test_cmp expect .gitattributes
>> +	)
>> +'
>> +
>> +test_expect_success 'Store files in LFS based on size (>13 bytes) and extension (dat)' '
>> +	client_view "//depot/... //client/..." &&
>> +	test_when_finished cleanup_git &&
>> +	(
>> +		cd "$git" &&
>> +		git init . &&
>> +		git config git-p4.largeFileSystem LFS &&
>> +		git config git-p4.largeFileExtensions dat &&
>> +		git config git-p4.largeFileThreshold 13 &&
>> +		git p4 clone --use-client-spec --destination="$git" //depot@all &&
>> +
>> +		test_file_in_lfs file2.dat 13 "bin 13 bytes" &&
>> +		test_file_in_lfs file4.bin 14 "bin 14 bytesx" &&
>> +
>> +		find ".git/lfs/objects" -type f >actual &&
>> +		test_line_count = 2 actual &&
>> +
>> +		cat >expect <<-\EOF &&
>> +
>> +		#
>> +		# LFS (see https://git-lfs.github.com/)
>> +		#
>> +		*.dat filter=lfs -text
>> +		/file4.bin filter=lfs -text
>> +		EOF
>> +		test_path_is_file .gitattributes &&
>> +		test_cmp expect .gitattributes
>> +	)
>> +'
>> +
>> +test_expect_success 'Remove file from repo and store files in LFS based on size (>12 bytes)' '
>> +	client_view "//depot/... //client/..." &&
>> +	(
>> +		cd "$cli" &&
>> +		p4 delete file4.bin &&
>> +		p4 submit -d "Remove file"
>> +	) &&
>> +
>> +	client_view "//depot/... //client/..." &&
>> +	test_when_finished cleanup_git &&
>> +	(
>> +		cd "$git" &&
>> +		git init . &&
>> +		git config git-p4.largeFileSystem LFS &&
>> +		git config git-p4.largeFileThreshold 12 &&
>> +		git p4 clone --use-client-spec --destination="$git" //depot@all &&
>> +
>> +		test_file_in_lfs file2.dat 13 "bin 13 bytes" &&
>> +		test_file_in_lfs "path with spaces/file3.bin" 13 "bin 13 bytes" &&
>> +		! test_path_is_file file4.bin &&
>> +
>> +		find ".git/lfs/objects" -type f >actual &&
>> +		# 3 is correct here as some commits in the tree still reference the deleted file
>> +		test_line_count = 3 actual
>> +
>> +		cat >expect <<-\EOF &&
>> +
>> +		#
>> +		# LFS (see https://git-lfs.github.com/)
>> +		#
>> +		/file2.dat filter=lfs -text
>> +		/path[:space:]with[:space:]spaces/file3.bin filter=lfs -text
>> +		EOF
>> +		test_path_is_file .gitattributes &&
>> +		test_cmp expect .gitattributes
>> +	)
>> +'
>> +
>> +test_expect_success 'Add big files to repo and store files in LFS based on compressed size (>19 bytes)' '
>> +	client_view "//depot/... //client/..." &&
>> +	(
>> +		cd "$cli" &&
>> +		echo "bin 24 bytesxxxxxxxxxxx" >file5.bin &&
>> +		p4 add file5.bin &&
>> +		p4 submit -d "Add another binary file with different content" &&
>> +
>> +		echo "bin 34 bytesxxxxxxxxxxxzzzzzzzzzz" >file6.bin &&
>> +		p4 add file6.bin &&
>> +		p4 submit -d "Add another binary file with different content"
>> +	) &&
>> +
>> +	client_view "//depot/... //client/..." &&
>> +	test_when_finished cleanup_git &&
>> +	(
>> +		cd "$git" &&
>> +		git init . &&
>> +		git config git-p4.largeFileSystem LFS &&
>> +		git config git-p4.largeFileCompressedThreshold 19 &&
>> +		# We only import HEAD here ("@all" is missing!)
>> +		git p4 clone --use-client-spec --destination="$git" //depot &&
>> +
>> +		test_file_in_lfs file6.bin 13 "bin 34 bytesxxxxxxxxxxxzzzzzzzzzz" &&
>> +
>> +		find ".git/lfs/objects" -type f >actual &&
>> +		test_line_count = 1 actual
>> +
>> +		cat >expect <<-\EOF &&
>> +
>> +		#
>> +		# LFS (see https://git-lfs.github.com/)
>> +		#
>> +		/file6.bin filter=lfs -text
>> +		EOF
>> +		test_path_is_file .gitattributes &&
>> +		test_cmp expect .gitattributes
>> +	)
>> +'
>> +
>> +test_expect_success 'Clone repo with existing .gitattributes file' '
>> +	client_view "//depot/... //client/..." &&
>> +	(
>> +		cd "$cli" &&
>> +
>> +		echo "*.txt text" >.gitattributes &&
>> +		p4 add .gitattributes &&
>> +		p4 submit -d "Add .gitattributes"
>> +	) &&
>> +
>> +	test_must_fail git p4 clone --use-client-spec --destination="$git" //depot 2>errs &&
>> +	grep ".gitattributes already exists in P4." errs
>> +'
>> +
>> +test_expect_success 'kill p4d' '
>> +	kill_p4d
>> +'
>> +
>> +test_done

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]