> -----Original Message----- > From: Andrés Robinet [mailto:agrobinet@xxxxxxxxxxxxx] > Sent: 12 March 2008 12:33 > To: 'Edward Kay'; 'mathieu leddet'; php-general@xxxxxxxxxxxxx > Subject: RE: Comparing files > > > > -----Original Message----- > > From: Edward Kay [mailto:edward@xxxxxxxxxx] > > Sent: Wednesday, March 12, 2008 7:13 AM > > To: mathieu leddet; php-general@xxxxxxxxxxxxx > > Subject: RE: Comparing files > > > > > > > > > -----Original Message----- > > > From: mathieu leddet [mailto:mathieu.leddet@xxxxxxxxxxxxxxx] > > > Sent: 12 March 2008 11:04 > > > To: php-general@xxxxxxxxxxxxx > > > Subject: Comparing files > > > > > > > > > Hi all, > > > > > > I have a simple question : how can I ensure that 2 files are > identical ? > > > > > > How about this ? > > > > > > --------8<------------------------------------------------------ > > > > > > function files_identical($path1, $path2) { > > > > > > return (file_get_contents($path1) == file_get_contents($path2)); > > > > > > } > > > > > > --------8<------------------------------------------------------ > > > > > > Note that I would like to compare any type of files (text and binary). > > > > > > Thanks for any help, > > > > > > > Depending upon the size of the files, I would expect it would > be quicker to > > compare a hash of each file. > > > > Edward > > > > I don't understand how comparing hashes can be faster than > comparing contents, > except for big files for which you will likely hit the memory > limit first and > for files who only differ from each other at the very end of them, so the > comparison will only be halted then. If the file sizes vary too > much, however, a > mixed strategy would be the winner; and certainly, you will want > to store path > names and calculated hashes in a database of some kind to save > yourself from > hogging the server each time (yeah, CPU and RAM are cheap, but > not unlimited > resources). > > Comparing hashes means that a hash must be calculated for files A > and B and the > related overhead will increase according to the file size (right > or wrong?). > Comparing the file contents will have an associated overhead for > buffering and > moving the file contents into memory, and it's also a linear > operation (strings > are compared byte to byte till there's a difference). So... why > not doing the > following? > > 1 - Compare file sizes (this is just a property stored in the file system > structures, right?). If sizes are different, the files are > different. Otherwise > move to step 2. > 2 - If the file sizes are smaller than certain size (up to you to find the > optimal file size), just compare contents through, say, file_get_contents. > Otherwise move to step 3. > 3 - Grab some random bytes at the beginning, at the middle and at > the end of > both files and compare them. If they are different, the files are > different. > Otherwise move to step 4. > 4 - If you reach this point, you are doomed. You have 2 big files > that you must > compare and they are apparently equal so far. Comparing contents > will be over > killing if at all possible, so you will want to generate hashes > and compare > them. Run md5_file on both files (it would be great if you have, > say, file A's > hash already calculated and stored in a DB or data file) and > compare results. > > It is always up to what kind of files you are dealing with, if > the files are > often different only at the end of the stream, you may want to > skip step 2. But > this is what I would generally do. > > By the way, md5 is a great hashing function, but it is not bullet-proof, > collisions may happen (though it's much better than crc32, for > example). So, you > may also think of how critical is to you to have some false > positives (some > files that are considered equal by md5_file and they are not) and > probably use > some diff-like solution instead of md5_file. Anyway, having > compared sizes and > random bytes (steps 1 through 3), it's very likely that md5_file > will catch it > if two files are different in just a few bytes. > Agreed. In by first reply, I meant that hashes would likely be quicker/more memory friendly when handling larger files, but this is just a hunch - I haven't benchmarked anything. It was really meant to give the OP other possibilities to look into. Edward -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php