Re: PATCH - Optimization to isomd5sum media check

Luciano Miguel Ferreira Rocha <strange@xxxxxxxxxxxxx> · Fri, 25 Mar 2005 22:57:51 +0000

On Fri, Mar 25, 2005 at 02:59:25PM -0500, Dustin kirkland wrote:
> Hi-
> 
> I'm submitting a patch that provides an optimization to the isomd5sum
> media check functionality.  Presently, the algorithm calculates a
> single md5sum after reading the entire disc, and reports a "PASS" or
> "FAIL" at that time.  When there is corrupted data on the disc, a user
> will not know until the entire disc has been read--which can take
> quite a while.  It would be optimal for the algorithm to exit the
> checking and report a failure as soon after reading the corrupted
> section as possible.
> 
> I've accomplished this by adding two extra fields to the application
> data header of the ISO image.  The ISO 9660 spec allows for a
> ~512-character string, of which were 224 characters are used.  I've
> added two fields, "FRAGMENT SUMS" and "FRAGMENT COUNT", after which
> 326 characters are used.
> 
> The "FRAGMENT COUNT" is an integer representing the number of
> fragments in which to break the entire ISO.  The "FRAGMENT SUMS" is an
> ascii string of concatenated leading characters of md5sums of
> subsequent fragments.
> 
> I've #define'd the length of FRAGMENT SUMS to 60 characters, as this
> number is evenly divisible by (2, 3, 4, 5, 6, 10, 12, 15, 20, 30). 
> I've also #define'd FRAGMENT COUNT to be 20, which means that ISO will
> be broken into 20 fragments, each fragment represented by a
> 3-character string (the first 3 characters of the fragment's md5sum)
> in FRAGMENT SUMS.
> 
> Looking at an example, the application data header used to look like:
> ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS =
> 15;RHLISOSTATUS=1;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
> 
> It now looks like:
> ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS =
> 15;RHLISOSTATUS=0;FRAGMENT SUMS =
> 67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4;FRAGMENT
> COUNT= 20;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
> 
> The string [67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4]
> is actually a concatenation of:
> md5(frag(0))[0-2] = 67b
> md5(frag(1))[0-2] = d32
> md5(frag(2))[0-2] = 63c
> md5(frag(3))[0-2] = ccf
> md5(frag(4))[0-2] = da2
> md5(frag(5))[0-2] = d85
> md5(frag(6))[0-2] = 1a4
> md5(frag(7))[0-2] = 93b
> md5(frag(8))[0-2] = 9a2
> md5(frag(9))[0-2] = 38e
> md5(frag(10))[0-2] = 887
> md5(frag(11))[0-2] = 7c4
> md5(frag(12))[0-2] = 5bc
> md5(frag(13))[0-2] = a78
> md5(frag(14))[0-2] = e27
> md5(frag(15))[0-2] = 6e9
> md5(frag(16))[0-2] = 443
> md5(frag(17))[0-2] = 7dd
> md5(frag(18))[0-2] = 9f2
> md5(frag(19))[0-2] = 6e4
> 
> Thus, if any one of these prescribed strings does not match what is
> calculated, the checking exits immediately, as a corruption has been
> detected.  In the cases where the error is near the beginning of the
> disc, the time to identify the problem is greatly reduced.
> 
> Note that there is a chance of collision.  When the disc is broken
> into 20 fragments, each fragment is represented by a 3-character
> string, yielding a collision 1 in 16^3=4096 cases.  Such a collision
> only means that the user will not benefit from the optimization, as
> the entire disc will be checked by the last operation just like it has
> always been done and the error will be detected at that time.  One can
> reduce the chances of collision by decreasing the number of fragments,
> or expanding the length of the FRAGMENT SUMS string.
> 
> This optimization occurs at virtually no performance cost.  Media
> checking is overwhelmingly I/O bound, reading from disc.  The
> additional calculations of smaller sums adds little  additional
> processing time, as most of that time was being spent waiting on new
> data from disc.
> 
> Please consider for inclusion in Anaconda, and I'll be glad to answer
> any questions or address concerns.  I think the patch is pretty
> simple, and isolated to two files.

I like it! Just one question: why aren't you using the intermediate
values of the full MD5?

Regards,
Luciano Rocha