> -----Original Message----- > From: Edward Kay [mailto:edward@xxxxxxxxxx] > Sent: Wednesday, March 12, 2008 7:13 AM > To: mathieu leddet; php-general@xxxxxxxxxxxxx > Subject: RE: Comparing files > > > > > -----Original Message----- > > From: mathieu leddet [mailto:mathieu.leddet@xxxxxxxxxxxxxxx] > > Sent: 12 March 2008 11:04 > > To: php-general@xxxxxxxxxxxxx > > Subject: Comparing files > > > > > > Hi all, > > > > I have a simple question : how can I ensure that 2 files are identical ? > > > > How about this ? > > > > --------8<------------------------------------------------------ > > > > function files_identical($path1, $path2) { > > > > return (file_get_contents($path1) == file_get_contents($path2)); > > > > } > > > > --------8<------------------------------------------------------ > > > > Note that I would like to compare any type of files (text and binary). > > > > Thanks for any help, > > > > Depending upon the size of the files, I would expect it would be quicker to > compare a hash of each file. > > Edward > I don't understand how comparing hashes can be faster than comparing contents, except for big files for which you will likely hit the memory limit first and for files who only differ from each other at the very end of them, so the comparison will only be halted then. If the file sizes vary too much, however, a mixed strategy would be the winner; and certainly, you will want to store path names and calculated hashes in a database of some kind to save yourself from hogging the server each time (yeah, CPU and RAM are cheap, but not unlimited resources). Comparing hashes means that a hash must be calculated for files A and B and the related overhead will increase according to the file size (right or wrong?). Comparing the file contents will have an associated overhead for buffering and moving the file contents into memory, and it's also a linear operation (strings are compared byte to byte till there's a difference). So... why not doing the following? 1 - Compare file sizes (this is just a property stored in the file system structures, right?). If sizes are different, the files are different. Otherwise move to step 2. 2 - If the file sizes are smaller than certain size (up to you to find the optimal file size), just compare contents through, say, file_get_contents. Otherwise move to step 3. 3 - Grab some random bytes at the beginning, at the middle and at the end of both files and compare them. If they are different, the files are different. Otherwise move to step 4. 4 - If you reach this point, you are doomed. You have 2 big files that you must compare and they are apparently equal so far. Comparing contents will be over killing if at all possible, so you will want to generate hashes and compare them. Run md5_file on both files (it would be great if you have, say, file A's hash already calculated and stored in a DB or data file) and compare results. It is always up to what kind of files you are dealing with, if the files are often different only at the end of the stream, you may want to skip step 2. But this is what I would generally do. By the way, md5 is a great hashing function, but it is not bullet-proof, collisions may happen (though it's much better than crc32, for example). So, you may also think of how critical is to you to have some false positives (some files that are considered equal by md5_file and they are not) and probably use some diff-like solution instead of md5_file. Anyway, having compared sizes and random bytes (steps 1 through 3), it's very likely that md5_file will catch it if two files are different in just a few bytes. Regards, Rob Andrés Robinet | Lead Developer | BESTPLACE CORPORATION 5100 Bayview Drive 206, Royal Lauderdale Landings, Fort Lauderdale, FL 33308 | TEL 954-607-4207 | FAX 954-337-2695 | Email: info@xxxxxxxxxxxxx | MSN Chat: best@xxxxxxxxxxxxx | SKYPE: bestplace | Web: bestplace.biz | Web: seo-diy.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php