On 5/23/23 7:06 PM, Eric Biggers wrote:
> On Tue, May 23, 2023 at 10:25:01PM +0000, Eric Biggers wrote:
>> On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote:
>>> On 5/23/23 6:06 PM, Eric Biggers wrote:
>>>> On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
>>>>> MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was
originally
>>>>> written by Austin Appleby and placed in the public domain. This
version has
>>>>> been modified to produce the same result on both big endian and
little
>>>>> endian processors, making it suitable for use in portable
persistent data.
>>>>>
>>>>> Signed-off-by: J. corwin Coburn <corwin@xxxxxxxxxx>
>>>>> ---
>>>>> drivers/md/dm-vdo/murmurhash3.c | 175
++++++++++++++++++++++++++++++++
>>>>> drivers/md/dm-vdo/murmurhash3.h | 15 +++
>>>>> 2 files changed, 190 insertions(+)
>>>>> create mode 100644 drivers/md/dm-vdo/murmurhash3.c
>>>>> create mode 100644 drivers/md/dm-vdo/murmurhash3.h
>>>>
>>>> Do we really need yet another hash algorithm?
>>>>
>>>> xxHash is another very fast non-cryptographic hash algorithm that
is already
>>>> available in the kernel (lib/xxhash.c).
>>>>
>>>> - Eric
>>>
>>> The main reason why vdo uses Murmur3 and not xxHash is that vdo has
been in
>>> deployment since 2013, and, if I am reading correctly, xxHash did
not have a
>>> 128 bit variant until 2019.
Before I dive into the more technical details of the index, let me
elaborate a bit as well. One of the key ideas in the design of vdo is
that the index is largely independent of the data store. The index is
used only for deduplication hints and is not required to read and
write data.
As such, switching hash algorithms is not out of the question. It
would be inconvenient for existing deployments as their existing
indexes would cease to work after an upgrade, and so they would lose
deduplication on new writes against data written before the
upgrade. However, there would be no loss of data. As such, if we
determine that avoiding this issue for existing deployments is not
worth the cost of supporting another hash algorithm, we are open to
making that change.
The on disk format of the data store portion of vdo is more amenable
to upgrade, so nothing is set in stone.
>>
>> Why do you need a 128-bit non-cryptographic hash algorithm? What
problem are
>> you trying to solve?
>
> To elaborate a bit: the reason this seems strange to me is that when
people say
> they need a 128-bit (or larger) non-cryptographic hash function,
usually they
> are either mistaken and 64-bit would suffice, or they actually need a
> cryptographic hash function.
>
> In this case, the hash function seems to be used for data
deduplication. This
> is unusual, since data deduplication normally requires a
cryptographic hash
> algorithm.
>
> I see that this is touched on by the following paragraph in
vdo-design.rst
> (though it incorrectly calls the hash an "identifier for the block"):
>
> Each block of data is hashed to produce a 16-byte block name
which serves as
> an identifier for the block. An index record consists of this
block name
> paired with the presumed location of that data on the underlying
storage.
> However, it is not possible to guarantee that the index is
accurate. Most
> often, this occurs because it is too costly to update the index
when a block
> is over-written or discarded. Doing so would require either
storing the
> block name along with the blocks, which is difficult to do
efficiently in
> block-based storage, or reading and rehashing each block before
overwriting
> it. Inaccuracy can also result from a hash collision where two
different
> blocks have the same name. In practice, this is extremely
unlikely, but
> because vdo does not use a cryptographic hash, a malicious
workload can be
> constructed. Because of these inaccuracies, vdo treats the
locations in the
> index as hints, and reads each indicated block to verify that it
is indeed a
> duplicate before sharing the existing block with a new one.
>
> So, dm-vdo handles hash collisions by reading back and verifying that
the data
> matches before allowing deduplication.
>
> That solves the security problem. However, it does seem strange, and
it's more
> complex than the usual solution of just using a cryptographic hash.
Note that
> cryptographic hashing is very fast on modern CPUs with modern algorithms.
Merely using a cryptographic hash doesn't solve the overwrite problem
(detailed in the paragraph you quote above). Since vdo has to do a
read-verify anyway, there is no benefit to using a stronger hash.
>
> So, some more details about the rationale for designing the data
deduplication
> in this (IMO unusual) way should be included.
>
> - Eric
>
The design of the index is intended to provide sufficient performance
without using too much memory. The index lookup for old records (those
not so new that they haven't been written out yet) is detailed in
vdo-design.rst:
In order to find records in older chapters, the index also
maintains a higher level structure called the volume index,
which contains entries mapping a block name to the chapter
containing its newest record. This mapping is updated as records
for the block name are copied or updated, ensuring that only the
newer record for a given block name is findable. Older records
for a block name can no longer be found even though they have
not been deleted. Like the chapter index, the volume index uses
only a subset of the block name as its key and can not
definitively say that a record exists for a name. It can only
say which chapter would contain the record if a record
exists. The volume index is stored entirely in memory and is
saved to storage only when the vdo target is shut down.
From the viewpoint of a request for a particular block name,
first it will look up the name in the volume index which will
indicate either that the record is new, or which chapter to
search. If the latter, the request looks up its name in the
chapter index to determine if the record is new, or which record
page to search. Finally, if not new, the request will look for
its record on the indicated record page. This process may
require up to two page reads per request (one for the chapter
index page and one for the request page). However, recently
accessed pages are cached so that these page reads can be
amortized across many block name requests.
The volume index and the chapter indexes are implemented using a
memory-efficient structure called a delta index. Instead of
storing the entire key (the block name) for each entry, the
entries are sorted by name and only the difference between
adjacent keys (the delta) is stored. Because we expect the
hashes to be evenly distributed, the size of the deltas follows
an exponential distribution. Because of this distribution, the
deltas are expressed in a Huffman code to take up even less
space. The entire sorted list of keys is called a delta
list. This structure allows the index to use many fewer bytes
per entry than a traditional hash table, but it is slightly more
expensive to look up entries, because a request must read every
entry in a delta list to add up the deltas in order to find the
record it needs. The delta index reduces this lookup cost by
splitting its key space into many sub-lists, each starting at a
fixed key value, so that each individual list is short.
Even with the compression inherent in the delta list structure,
keeping the entire delta for each entry in memory is too costly. The
multi-level structure of the index allows us to reduce the memory size
while still achieving acceptable performance. The volume index is
computed on only a portion of the full hash, greatly reducing the size
of each entry. Similarly, the chapter index is computed on a different
subset of the full hash thereby reducing the amount of data that needs
to be read and cached. These savings come at the cost of false
positives in both the volume index and the chapter index, each of
which result in spurious reads and hence reduced performance.
In order to bound the reading required for each lookup, when the
volume index needs to store multiple entries whose volume index bits
collide (but whose full hashed do not), all but the first entry are
stored not as deltas but as "collision records" which contain the full
128 bit hash. These records ensure that any lookup requires reading at
most one index page and one record page. However, collision records
consume much more memory than normal entries.
The number of bits used by the volume and chapter index are tuned to
balance memory against the false positive rate (note that the false
positive rate in the chapter index would be significantly higher if
the chapter index bits were to overlap with the volume index bits). We
have found that 8 bytes for the volume index and 6 for the chapter
index give good performance with an acceptable memory budget
(approximately 3 bytes per entry).
The index also supports a "sparse" mode which uses ten times as many
chapters as the default, "dense" mode with a volume index of the same
size. This mode leverages the temporal locality which is usually
present in workloads with high dedupe rates. Rather than keeping every
entry in the volume index, only 1/32 of entries in most of the index
are retained in the volume index. On lookup, if a hash is not found in
the volume index, the cached chapter indexes are also searched before
deciding the entry is new. 5 of the remainin bits of the hash are used
to select which entries are retained in the sparse portion of the
volume index.
corwin