In xfs_db, the metadump creation procedure optionally allows you to replace all file names in the dump with obfuscated versions of those names. This has to be done in such a way that the hash value associated with the obfuscated name is the same as the hash assigned to the original file. The algorithm that creates these obfuscated file names creates them by generating a random set of characters (using the full 8-bit range, only disallowing '/' and '\0') up to the last five. Along the way, the hash for the name is computed, and it--along with the hash for the original file--is used to determine what the last five characters need to be in order to match. There are some cases in this scheme, however, where files can't get an equivalent name using generate_obfuscated_name() as currently written (one of which Arkadiusz MiÅkiewicz found). Below I'm going to describe a bit more specifically how the hash works, and why some cases run into trouble. I also describe a way to adjust the algorithm to avoid this pitfall. XFS computes a filename hash by accumulating a 32-bit value by shifting (rotating) and XOR'ing each character in the name. - A 32-bit accumulator is initially zero-filled. Then with each character in the file name: - Rotate Left the contents of the 32-bit accumulator by 7 bits - XOR the next byte of the file name into the least-significant byte of the accumulator The algorithm for generating another filename having the same length and same hash starts by generating a string of random bytes (other than '\0' and '/') to replace each byte in the original name, for all but five characters in the name. The incremental hash for the new file name is computed as each byte is replaced. The last five characters are then selected such that they contribute to the hash exactly what is needed to produce a hash that matches the original name's hash. Note that no attempt is made to come up with an alternative name for names shorter than five characters. Arkadiusz MiÅkiewicz found that a file named "R\323\257NE" caused an attempt to create a metadump to hang. That 5-character file name, interpreted in hexadecimal, is: 0x52 d3 af 4e 45, and when run through the hashing algorithm the result is 0x3a4be740. Since there are only five characters in this name, no random characters are generated (so the hash value before selecting the last five characters is zero). To determine the last five characters, the accumulated hash (zero in this case) is XOR'd with the complete hash for the original filename. The resulting bits--after appropriate shifting--define exactly the bytes to use so the new file name has the same hash as the original. If one of those bytes happens to be illegal ('\0' or '/'), the process restarts in hopes a different random string of bytes will produce one that does work. The problem in this case is that there are no random characters in the name. So if any character dictated by the original hash is illegal, it will always be illegal and the process will repeat forever. That is the case here. In the "R\323\257NE" string, one of the characters that comes out as a necessary component of the obfuscated name is '/', which is not allowed. Hence the algorithm loops without terminating. So how do we fix this? We can tweak the algorithm a bit, based on a few observations, and the result will still generate filenames having the desired properties but will also avoid the infinite loop problem described above. Let's look at some details. Here is how the first (of the last five) byte is computed: newname[namelen - 4] = (newhash >> 21) & 0x7f; The hash for the problem name is 0x3a4b3740. The result after the shift is 0x1d2, and after the mask 0x52. That's an OK character. The next byte is computes like this: newname[namelen - 3] = (newhash >> 14) & 0x7f; For the problem name the shift gives 0xe92c, and after the mask it's 0x2f. That is NOT OK, it's the '/' character. And since there was never any randomization of characters involved, there's no chance that retrying the algorithm will come up with any other result. It turns out that this is just one of a whole class of file names that will run into this problem. I worked out an example of each of these. "\120\001\257\116\105", /* Byte 0 (first) must be zero */ "\122\001\257\116\105", /* Byte 1 must be zero */ "\122\323\200\116\105", /* Byte 2 must be zero */ "\122\323\254\001\305", /* Byte 3 must be zero */ "\122\323\247\116\005", /* Byte 4 (last) must be zero */ "\172\057\147\116\105", /* Byte 1 must be '/' */ "\122\320\057\116\105", /* Byte 2 must be '/' */ "\122\323\247\057\105", /* Byte 3 must be '/' */ "\002\323\247\116\057", /* Byte 4 (last) must be '/' */ (I didn't come up with one in which the first byte ends up having to be a '/'.) All of the above are 5-character names. It may be that with longer names the randomization of characters give the opportunity to avoid this. And if so, a very simple solution might be to just extend the length of file names that are not obfuscated from 4 to (say) 8. (But that probably means a LOT more files' names are put in the metadump without obfuscation.) First observation on the algorithm. Why is the high bit of each computed character masked off? The reason is that the high bit of each byte is XOR'd with the low bit of its neighbor, and masking the high bit off allows the low bit of the previous character to be selected without concern about the effect of the XOR. Second observation. Masking off the low bit could be used instead to achieve the same effect just described. Third observation. The two characters that are not allowed in a file name are '\0' and '/'. Their hex values are 0x00 and 0x2f. The current algorithm clears only the high bit to produce a new file name character, and that allows either of these two be the result of the mask. Clearing the low-order bit instead will never produce the '/' character. So if we can change the mask, we can eliminate half of the possible characters that will cause us problems. We are still left with the case that an important hunk of the original file's hash is 0x01, which produces 0x00 when the bottom bit is masked off. To address that we can flip the low bit and flip the corresponding high bit in the neighbor byte that it's XOR'd with, ensuring the result preserves the desired hash value. We can do this for any original hash value, and we can do this without any need to ever re-try generating a random name. I haven't determined yet whether there still might be a case that requires an invalid character to resolve. If we reach that point we can simply warn that the file is not getting obfuscated and leave the original file name as-is. I'll follow up later with a proposed change to implement what I have described here. -Alex PS Two more observations: - There is really no need for the characters to be truly random. Making the generated name unique and different from the original is sufficient. So (with the exception of the last five bytes) we can select the characters however we like. They could be a sequential series of names, for example, rather than computing a random value for each. - Similarly we could select (most of) the characters from a smaller subset than is currently used. I.e., rather than using any of 254 possible values, we could restrict it to just printable characters (or even a subset of those). The last five characters would be computed as above, and they would have to be unrestricted in order to produce the right hash. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs