RE: Comparing files

Andrés Robinet <agrobinet@xxxxxxxxxxxxx> · Wed, 12 Mar 2008 08:33:00 -0400

> -----Original Message-----
> From: Edward Kay [mailto:edward@xxxxxxxxxx]
> Sent: Wednesday, March 12, 2008 7:13 AM
> To: mathieu leddet; php-general@xxxxxxxxxxxxx
> Subject: RE:  Comparing files
> 
> 
> 
> > -----Original Message-----
> > From: mathieu leddet [mailto:mathieu.leddet@xxxxxxxxxxxxxxx]
> > Sent: 12 March 2008 11:04
> > To: php-general@xxxxxxxxxxxxx
> > Subject:  Comparing files
> >
> >
> > Hi all,
> >
> > I have a simple question : how can I ensure that 2 files are identical ?
> >
> > How about this ?
> >
> > --------8<------------------------------------------------------
> >
> > function files_identical($path1, $path2) {
> >
> >   return (file_get_contents($path1) == file_get_contents($path2));
> >
> > }
> >
> > --------8<------------------------------------------------------
> >
> > Note that I would like to compare any type of files (text and binary).
> >
> > Thanks for any help,
> >
> 
> Depending upon the size of the files, I would expect it would be quicker to
> compare a hash of each file.
> 
> Edward
> 

I don't understand how comparing hashes can be faster than comparing contents,
except for big files for which you will likely hit the memory limit first and
for files who only differ from each other at the very end of them, so the
comparison will only be halted then. If the file sizes vary too much, however, a
mixed strategy would be the winner; and certainly, you will want to store path
names and calculated hashes in a database of some kind to save yourself from
hogging the server each time (yeah, CPU and RAM are cheap, but not unlimited
resources).

Comparing hashes means that a hash must be calculated for files A and B and the
related overhead will increase according to the file size (right or wrong?).
Comparing the file contents will have an associated overhead for buffering and
moving the file contents into memory, and it's also a linear operation (strings
are compared byte to byte till there's a difference). So... why not doing the
following?

1 - Compare file sizes (this is just a property stored in the file system
structures, right?). If sizes are different, the files are different. Otherwise
move to step 2.
2 - If the file sizes are smaller than certain size (up to you to find the
optimal file size), just compare contents through, say, file_get_contents.
Otherwise move to step 3.
3 - Grab some random bytes at the beginning, at the middle and at the end of
both files and compare them. If they are different, the files are different.
Otherwise move to step 4.
4 - If you reach this point, you are doomed. You have 2 big files that you must
compare and they are apparently equal so far. Comparing contents will be over
killing if at all possible, so you will want to generate hashes and compare
them. Run md5_file on both files (it would be great if you have, say, file A's
hash already calculated and stored in a DB or data file) and compare results.

It is always up to what kind of files you are dealing with, if the files are
often different only at the end of the stream, you may want to skip step 2. But
this is what I would generally do.

By the way, md5 is a great hashing function, but it is not bullet-proof,
collisions may happen (though it's much better than crc32, for example). So, you
may also think of how critical is to you to have some false positives (some
files that are considered equal by md5_file and they are not) and probably use
some diff-like solution instead of md5_file. Anyway, having compared sizes and
random bytes (steps 1 through 3), it's very likely that md5_file will catch it
if two files are different in just a few bytes.

Regards,

Rob

Andrés Robinet | Lead Developer | BESTPLACE CORPORATION 
5100 Bayview Drive 206, Royal Lauderdale Landings, Fort Lauderdale, FL 33308 |
TEL 954-607-4207 | FAX 954-337-2695 | 
Email: info@xxxxxxxxxxxxx  | MSN Chat: best@xxxxxxxxxxxxx  |  SKYPE: bestplace |
 Web: bestplace.biz  | Web: seo-diy.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php