On 24/04/02 07:15PM, Patrick Steinhardt wrote: > On Tue, Apr 02, 2024 at 11:42:29AM -0500, Justin Tobler wrote: > > On 24/03/25 11:10AM, Patrick Steinhardt wrote: > > > When seeking a record in our block reader we perform a binary search > > > over the block's restart points so that we don't have to do a linear > > > scan over the whole block. The logic to do so is quite intricate though, > > > which makes it hard to understand. > > > > > > Improve documentation and rename some of the functions and variables so > > > that the code becomes easier to understand overall. This refactoring > > > should not result in any change in behaviour. > > > > > > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > > > --- > > ... > > > - i = binsearch(br->restart_count, &restart_key_less, &args); > > > + /* > > > + * Perform a binary search over the block's restart points, which > > > + * avoids doing a linear scan over the whole block. Like this, we > > > + * identify the section of the block that should contain our key. > > > + * > > > + * Note that we explicitly search for the first restart point _greater_ > > > + * than the sought-after record, not _greater or equal_ to it. In case > > > + * the sought-after record is located directly at the restart point we > > > + * would otherwise start doing the linear search at the preceding > > > + * restart point. While that works alright, we would end up scanning > > > + * too many record. > > > + */ > > > + i = binsearch(br->restart_count, &restart_needle_less, &args); > > > + > > > + /* > > > + * Now there are multiple cases: > > > + * > > > + * - `i == 0`: The wanted record must be contained before the first > > > + * restart point. We will thus start searching for the record in > > > + * that section after accounting for the header offset. > > > > To clarify my understanding, does a restart_offset not exist at the > > beginning of a block? I was originally under the impression that the > > first reset point would be at the beginning of the block (or just after > > the header). If this was the case, the first restart point being greater > > would indicate that the wanted record is not contained within the block. > > It wouldn't make much sense to include it as a restart point. A restart > point is a point where the prefix compression will be reset such that > the record at that point can be read without reading preceding records. > This is always implicitly true for the first record in a block as it is > never prefix-compressed. Consequently, writing a restart point for the > first record would be a waste of disk space. Thanks Patrick! Good to know :) >From Documentation/technical/reftable.txt: > As the first ref block shares the first file block with the file > header, all restart_offset in the first block are relative to the > start of the file (position 0), and include the file header. This > forces the first restart_offset to be 28. I'm assumming this is out-of-date. -Justin