Oops, it looks like my attachment disappeared. Anyways, here is a new version of the patch and the hash benchmark I meant to attach before (it has an older version of the modified SipHash) Changes: - Added the SipHash. It is a header that is included twice, once for strings only, again for variable names. It is one function body with behavior modified by the preprocessor. I go into details about the hash in the message. (I forgot to add that byte swapping is removed because who cares in a hash table?) - I made a check for alignment and check byte by byte if we are unaligned. - Changed exec.c to use SipHash as well. - Used getentropy(), falling back to /dev/urandom, then mixing (with XXH32_round from xxHash) the time, cycle count, and seeded rand() calls as a last resort. - I added getentropy and sys/random.h to configure.ac. - Renamed hash_alias to hash_string. - Use power of 2 hash tables because SipHash is good enough to warrant it. How is performance on your side? My computer and my phone are pretty old (Sandy Bridge and Cortex-A15), and I would like to know how this performs on other platforms On Sun, Mar 3, 2019 at 1:24 AM Devin Hussey <husseydevin@xxxxxxxxx> wrote: > > Actually, screw that hash. I think we should use SipHash13 for 64-bit > and HalfSipHash13 for 32-bit. SipHash is designed for strong hash > tables (instead of City64/Murmur2/3 which can be cracked with math > just like this one), used in the kernel and more, and is fast and > secure. > > While SipHash13 isn't as secure, it is very fast and still strong enough. > > Basically, what I did was modified the hash functions to check for > '\0' or '=' by comparing 4 bytes at a time with the strlen bit > twiddling method from Hackers' Delight, which is also used by Musl and > glibc in their strchrnul implementations. I was originally thinking > xxHash, which outperforms by a lot, but xxHash reads 16 bytes at a > time which is difficult to dynamically check without interrupting the > flow) It is also a much larger hash. > > I provide an alternative method for reading that reads one byte at a > time instead of reading chunks with memcpy because technically it does > read out of bounds, but we can potentially give malloc a bit more > space to be safe. Generally, though, it isn't a huge deal if it > doesn't go out of the page boundary. I think I should, to be safe, > check alignment and use byte by byte if unaligned to be safe, because > reading out of bounds unaligned has a chance of crossing a boundary, > however, it will likely impact performance a bit. > > The speed difference on x86_64 is actually quite reasonable. On > average it is only about 40% slower than the add hash. > (However, when compiled for i386 targeting Pentium 4, HalfSipHash is > twice as slow, and SipHash is stupid slow as expected), and > about the same as the one I just mentioned. Any speed offsets from the > hash function will be countered by the better > distribution from a better hash function. Both versions save about a > second and a half on dash's configure script vs the add hash, but that > is probably mostly due to the significantly larger table size. > > My benchmark read a file given line by line, then added up the sums of > the hashes > for each line 100 times. I attached my benchmark program. Define > CHECK_FOR_EQUALS to add checks for '=' (for hash_var), and > ONE_AT_A_TIME to only read > one byte at a time instead of reading 4-8. > > I checked a few strchrnul implementations and clearly the check as you > go is faster, despite kind of complicating the control flow. > > I am working on changing the implementations to be macro controlled, > so we can keep things tidy. > > As for resizing, do you think we need that? I think first work on this > and then consider it. Because simply doing this would make Dash faster > and much more secure against DoS attacks. > > Side note: I just realized that I missed the third usage of the add > hash in exec.c. > > On Fri, Mar 1, 2019 at 3:47 AM Adam Borowski <kilobyte@xxxxxxxxxx> wrote: > > > > On Thu, Feb 28, 2019 at 11:42:41PM -0500, Devin Hussey wrote: > > > This is a new patch to follow up to my non-cryptographic hash patch > > > that is a little higher quality and has a PoC DoS script. > > > > > 3. Seeding at startup. I try to read from /dev/urandom and fall > > > back to using clock(). This serves as the seed for the hash functions, > > > and makes things much less predictable so it can't just be shut down > > > with a static script. > > > > On any modern platform, please use getentropy() instead. It does the same > > thing as reading from /dev/urandom, yet is: > > * faster (dash is usually preferred over bash because of speed, and a file > > open is slow enough to be noticeable) > > * immune to file exhaustion attacks > > > > > > Meow! > > -- > > ⢀⣴⠾⠻⢶⣦⠀ > > ⣾⠁⢠⠒⠀⣿⡁ > > ⢿⡄⠘⠷⠚⠋⠀ Have you accepted Khorne as your lord and saviour? > > ⠈⠳⣄⠀⠀⠀⠀
Attachment:
0001-Use-larger-tables-seeding.-and-a-better-hash-functio.patch
Description: Binary data
Attachment:
hash_bench.cpp
Description: Binary data