On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: > On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote: > >> * The tuning parameter number_of_restarts currently trades off space >> (for the full refnames and the restart_offsets) against the need to >> read and parse more ref_records to get the full refnames. ISTM that >> this tradeoff could be made less painful by defining a blockwide >> prefix that is omitted from the refnames as used in the restarts. So >> the full refname would change from >> >> this_name = prior_name[0..prefix_length] + suffix >> >> to >> >> this_name = block_prefix + prior_name[0..prefix_length] + suffix >> >> I would expect this to allow more frequent restarts at lower space >> cost. > > I've been on the fence about the value of this. It makes the search > with restarts more difficult to implement, but does allow shrinking a > handful of very popular prefixes like "refs/" and "refs/pulls/" in > some blocks. > > An older format of reftable used only a block_prefix, and could not > get nearly as good compression as too many blocks contained references > with different prefixes. I ran an experiment on my 866k ref data set. Using a block_prefix gets less compression, and doesn't improve packing in the file. Given the additional code complexity, it really isn't worth it: format | size | blocks | avg ref/blk ------------------|----------|-----------|---------------- original | 28 M | 443 | 1955 block_prefix | 29 M | 464 | 1867 :-(