Re: long object names

Colin McCabe <cmccabe@xxxxxxxxxxxxxx> · Fri, 22 Apr 2011 10:36:07 -0700

On Fri, Apr 22, 2011 at 8:44 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> Few things:
>
> - I think the xattr approach is always going to be faster.  xattrs are
> stored adjacent to the inode in the btree, while creating intervening
> directories means a new inode is allocated, seeked to, and loaded, and
> _then_ the directory content is looked up in another part of the btree
> before the final inode is located.  For each level you add two seeks
> (although in the common case, at least, those inodes will be close by).

Fair enough.

> - You can't make intervening directories both rare (long) and useful for
> prefix search (short) unless you really think people will be searching on
> 100+ character prefixes.

Earlier I suggested making it configurable, so that we could have it
tuned to a short value on the cluster backing rgw, but a long value
elsewhere.

> - Hash collisions will be rare for all but our test cases.  If we only
> hash for long filenames (say, 200+ characters) that means someone has to
> find a SHA-256 collision (has anybody??).  And even then they only turn 1
> stat into 2.  Only if someone can generate an arbitrary number of inputs
> that hash to the same value do they get anywhere.  I don't think that's
> something we should worry about.  If someone breaks a crypto hash there
> are much bigger things to worry about.  (Even if we are super paranoid,
> then just sha(name + sha(name)).

A good guide to choosing a crypto hash: http://valerieaurora.org/hash.html

> - We can easily wrap the non-fast past with a mutex to avoid the races
> (because, again, collisions are vanishingly rare except in our test
> cases).

I believe that all these operations are already done under the PG
lock. So there are no race conditions in normal operation. TV is
talking about a case where there has been a crash and we're resuming
from some intermediate state. Based on our earlier discussion, perhaps
this is not a problem on btrfs because of the snapshotting mechanic?

cheers,
Colin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html