On Fri, Mar 25, 2005 at 02:59:25PM -0500, Dustin kirkland wrote: > Hi- > > I'm submitting a patch that provides an optimization to the isomd5sum > media check functionality. Presently, the algorithm calculates a > single md5sum after reading the entire disc, and reports a "PASS" or > "FAIL" at that time. When there is corrupted data on the disc, a user > will not know until the entire disc has been read--which can take > quite a while. It would be optimal for the algorithm to exit the > checking and report a failure as soon after reading the corrupted > section as possible. > > I've accomplished this by adding two extra fields to the application > data header of the ISO image. The ISO 9660 spec allows for a > ~512-character string, of which were 224 characters are used. I've > added two fields, "FRAGMENT SUMS" and "FRAGMENT COUNT", after which > 326 characters are used. > > The "FRAGMENT COUNT" is an integer representing the number of > fragments in which to break the entire ISO. The "FRAGMENT SUMS" is an > ascii string of concatenated leading characters of md5sums of > subsequent fragments. > > I've #define'd the length of FRAGMENT SUMS to 60 characters, as this > number is evenly divisible by (2, 3, 4, 5, 6, 10, 12, 15, 20, 30). > I've also #define'd FRAGMENT COUNT to be 20, which means that ISO will > be broken into 20 fragments, each fragment represented by a > 3-character string (the first 3 characters of the fragment's md5sum) > in FRAGMENT SUMS. > > Looking at an example, the application data header used to look like: > ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS = > 15;RHLISOSTATUS=1;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!! > > It now looks like: > ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS = > 15;RHLISOSTATUS=0;FRAGMENT SUMS = > 67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4;FRAGMENT > COUNT= 20;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!! > > The string [67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4] > is actually a concatenation of: > md5(frag(0))[0-2] = 67b > md5(frag(1))[0-2] = d32 > md5(frag(2))[0-2] = 63c > md5(frag(3))[0-2] = ccf > md5(frag(4))[0-2] = da2 > md5(frag(5))[0-2] = d85 > md5(frag(6))[0-2] = 1a4 > md5(frag(7))[0-2] = 93b > md5(frag(8))[0-2] = 9a2 > md5(frag(9))[0-2] = 38e > md5(frag(10))[0-2] = 887 > md5(frag(11))[0-2] = 7c4 > md5(frag(12))[0-2] = 5bc > md5(frag(13))[0-2] = a78 > md5(frag(14))[0-2] = e27 > md5(frag(15))[0-2] = 6e9 > md5(frag(16))[0-2] = 443 > md5(frag(17))[0-2] = 7dd > md5(frag(18))[0-2] = 9f2 > md5(frag(19))[0-2] = 6e4 > > Thus, if any one of these prescribed strings does not match what is > calculated, the checking exits immediately, as a corruption has been > detected. In the cases where the error is near the beginning of the > disc, the time to identify the problem is greatly reduced. > > Note that there is a chance of collision. When the disc is broken > into 20 fragments, each fragment is represented by a 3-character > string, yielding a collision 1 in 16^3=4096 cases. Such a collision > only means that the user will not benefit from the optimization, as > the entire disc will be checked by the last operation just like it has > always been done and the error will be detected at that time. One can > reduce the chances of collision by decreasing the number of fragments, > or expanding the length of the FRAGMENT SUMS string. > > This optimization occurs at virtually no performance cost. Media > checking is overwhelmingly I/O bound, reading from disc. The > additional calculations of smaller sums adds little additional > processing time, as most of that time was being spent waiting on new > data from disc. > > Please consider for inclusion in Anaconda, and I'll be glad to answer > any questions or address concerns. I think the patch is pretty > simple, and isolated to two files. I like it! Just one question: why aren't you using the intermediate values of the full MD5? Regards, Luciano Rocha