RE: Comparing files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Andrés Robinet [mailto:agrobinet@xxxxxxxxxxxxx]
> Sent: 12 March 2008 12:33
> To: 'Edward Kay'; 'mathieu leddet'; php-general@xxxxxxxxxxxxx
> Subject: RE:  Comparing files
>
>
> > -----Original Message-----
> > From: Edward Kay [mailto:edward@xxxxxxxxxx]
> > Sent: Wednesday, March 12, 2008 7:13 AM
> > To: mathieu leddet; php-general@xxxxxxxxxxxxx
> > Subject: RE:  Comparing files
> >
> >
> >
> > > -----Original Message-----
> > > From: mathieu leddet [mailto:mathieu.leddet@xxxxxxxxxxxxxxx]
> > > Sent: 12 March 2008 11:04
> > > To: php-general@xxxxxxxxxxxxx
> > > Subject:  Comparing files
> > >
> > >
> > > Hi all,
> > >
> > > I have a simple question : how can I ensure that 2 files are
> identical ?
> > >
> > > How about this ?
> > >
> > > --------8<------------------------------------------------------
> > >
> > > function files_identical($path1, $path2) {
> > >
> > >   return (file_get_contents($path1) == file_get_contents($path2));
> > >
> > > }
> > >
> > > --------8<------------------------------------------------------
> > >
> > > Note that I would like to compare any type of files (text and binary).
> > >
> > > Thanks for any help,
> > >
> >
> > Depending upon the size of the files, I would expect it would
> be quicker to
> > compare a hash of each file.
> >
> > Edward
> >
>
> I don't understand how comparing hashes can be faster than
> comparing contents,
> except for big files for which you will likely hit the memory
> limit first and
> for files who only differ from each other at the very end of them, so the
> comparison will only be halted then. If the file sizes vary too
> much, however, a
> mixed strategy would be the winner; and certainly, you will want
> to store path
> names and calculated hashes in a database of some kind to save
> yourself from
> hogging the server each time (yeah, CPU and RAM are cheap, but
> not unlimited
> resources).
>
> Comparing hashes means that a hash must be calculated for files A
> and B and the
> related overhead will increase according to the file size (right
> or wrong?).
> Comparing the file contents will have an associated overhead for
> buffering and
> moving the file contents into memory, and it's also a linear
> operation (strings
> are compared byte to byte till there's a difference). So... why
> not doing the
> following?
>
> 1 - Compare file sizes (this is just a property stored in the file system
> structures, right?). If sizes are different, the files are
> different. Otherwise
> move to step 2.
> 2 - If the file sizes are smaller than certain size (up to you to find the
> optimal file size), just compare contents through, say, file_get_contents.
> Otherwise move to step 3.
> 3 - Grab some random bytes at the beginning, at the middle and at
> the end of
> both files and compare them. If they are different, the files are
> different.
> Otherwise move to step 4.
> 4 - If you reach this point, you are doomed. You have 2 big files
> that you must
> compare and they are apparently equal so far. Comparing contents
> will be over
> killing if at all possible, so you will want to generate hashes
> and compare
> them. Run md5_file on both files (it would be great if you have,
> say, file A's
> hash already calculated and stored in a DB or data file) and
> compare results.
>
> It is always up to what kind of files you are dealing with, if
> the files are
> often different only at the end of the stream, you may want to
> skip step 2. But
> this is what I would generally do.
>
> By the way, md5 is a great hashing function, but it is not bullet-proof,
> collisions may happen (though it's much better than crc32, for
> example). So, you
> may also think of how critical is to you to have some false
> positives (some
> files that are considered equal by md5_file and they are not) and
> probably use
> some diff-like solution instead of md5_file. Anyway, having
> compared sizes and
> random bytes (steps 1 through 3), it's very likely that md5_file
> will catch it
> if two files are different in just a few bytes.
>

Agreed. In by first reply, I meant that hashes would likely be quicker/more
memory friendly when handling larger files, but this is just a hunch - I
haven't benchmarked anything. It was really meant to give the OP other
possibilities to look into.

Edward


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux