Hi Mathias, On 01/19/2016 09:35 AM, Mathias Mueller wrote: > Hi Phil, > > I forgot to add some information: when I was creating the bytestrings > from my jpg file, I did not start from 0k but from 100k of the jpg file > (to skip the jpg header). Ok. But I'm still not confident of chunk boundaries. >> Very interesting. You could go one step further and compare the jpeg >> file contents in the first 1M against the locations found to determine >> where the chunks actually start and end on each device. The final >> offset will be a chunk multiple before these boundaries. Or do md5 sums >> of 4k blocks to reduce the amount to inspect. > > How exactly can I do this? Should I create more Bytestrings and do more > brep with them on my physical devices? I have already results from > searching bytestrings with an offset of 64k (starting from 100k to 612k > of my jpeg file, so 9 bytestrings at all). Should I provide a table of > the results? Sigh. I couldn't help myself. New utility attached. Curse you Mathias for an interesting problem! ;-) Call it with your jpeg and the devices to search, like so: findHash.py /path/to/picture.jpeg /dev/sd[bcde] It'll make a map of hashes of each 4k block in the jpeg and then search the listed devices for those hashes, building a map of the file fragments. This will clearly show chunk boundaries. Please show the output. Phil
#! /usr/bin/python2 # # Locate 4k fragments of a subject file in one or more other files or # devices. Only reports two or more consecutive matches. # # Usage: # findHash.py /path/to/subject/file /dev/sdx|/path/to/image/file [/dev/sdy ...] import hashlib, sys, datetime # Read the known file 4k at a time, building a dictionary of # md5 hashes vs. offset. Use a large buffer for speed. # Drops any partial block at the end of the file. d = {} pos = long(0) f = open(sys.argv[1], 'r', 1<<20) b = f.read(4096) while len(b)==4096: md5 = hashlib.md5() md5.update(b) h = md5.digest() hlist = d.get(h) if not hlist: hlist = [] d[h] = hlist # print "New hash %s at %8.8x" % (h.encode('hex'), pos) hlist.append(pos) pos += 4096 b = f.read(4096) f.close() print "%d Unique hashes in %s" % (len(d), sys.argv[1]) def checkAndPrint(match): if match[2]>4096: print "%20s @ %12.12x:%12.12x ~= %8.8x:%8.8x" % (fname, match[1], match[1]+match[2]-1, match[0], match[0]+match[2]-1) # Read the candidate files/devices, looking for possible matches. Match # entries are vectors of known file offset, candidate file offset, and # length. for fname in sys.argv[2:]: print "\nSearching for pieces of %s in %s:..." % (sys.argv[1], fname) pos = long(0) f = open(fname, 'r', 1<<24) matches = [] b = f.read(4096) lastts = None while len(b)==4096: if not (pos & 0x7ffffff): ts = datetime.datetime.now() if lastts: print "@ %12.12x %.1fMB/s \r" % (pos, 128.0/((ts-lastts).total_seconds())), else: print "@ %12.12x...\r" % pos, sys.stdout.flush() lastts = ts md5 = hashlib.md5() md5.update(b) h = md5.digest() if h in d: i = 0 while i<len(matches): match = matches[i] target = match[0]+match[2] continuations = [x for x in d[h] if x==target] if continuations: match[2] += 4096 i += 1 else: del matches[i] checkAndPrint(match) if not matches: matches = [[x, pos, 4096] for x in d[h]] else: for match in matches: checkAndPrint(match) matches = [] pos += 4096 b = f.read(4096) print "End of %s at %12.12x" % (fname, pos) # show matches that continue to the end of the candidate file/device. for match in matches: checkAndPrint(match)