Re: long object names

Colin McCabe <cmccabe@xxxxxxxxxxxxxx> · Thu, 21 Apr 2011 14:44:50 -0700

On Thu, Apr 21, 2011 at 2:23 PM, Yehuda Sadeh Weinraub
<yehudasa@xxxxxxxxx> wrote:
> On Thu, Apr 21, 2011 at 2:09 PM, Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote:
>> On Thu, Apr 21, 2011 at 1:03 PM, Gregory Farnum
>> <gregory.farnum@xxxxxxxxxxxxx> wrote:
>>> I really don't see how pushing the naming complexity into the local filesystem,
>>> where it adds lots of otherwise-useless inodes and dentries, is going to help us.
>>
>> Here is a quick summary of how the TV's proposal would help us.
>> 1. it avoids collisions entirely
>> 2. You don't ever have do an extra xattr lookup, no matter how short
>> or long the object name is.
>
> Yeah, but you read more directories. Note that btrfs stores the xattrs
> on the directories, so reading those xattrs will have a lower IO
> impact than traversing directories recursively.

It does seem like btrfs' extended attribute implementation is fairly
efficient. But Linux's dentry cache (dcache) is also pretty efficient.

TV's approach involves fewer syscalls and no loop.

I also wonder how xattr performance is on ext3/4 these days.
I think benchmarks would be needed to really settle this question. I'm
almost tempted to write one...

sincerely,
Colin

>
>>
>> My add-on proposal helps us:
>> 3. get reasonable prefix search performance (with those supposedly
>> "useless" dentries)
>>
>>> I like what Yehuda has here for its relative simplicity -- though I think we should just up
>>> the hash size enough that we don't need to handle collisions,
>>
>> Personally, I think the xattr proposal is more complex. I guess that
>> is a matter of taste.
>>
>> No matter how big your hash table will be, there are still collisions!
>> That is the nature of hashing. And since the code is open source, it's
>> pretty easy for an attacker to read the source and then create two
>> objects whose names collide.
>
> Sure there will be, and the code should handle it. With a good hashing
> scheme having a collision will be pretty rare.
>
>>
>> So far, the only disadvantage that has been pointed out to TV's scheme
>> is that it creates extra dentries. But those extra dentries only
>> affect long object names, not the ones that (for example) the Ceph FS
>> creates. Also, when long object names occur in S3, they don't tend to
>> come out of the blue. They come about because the organization has a
>> sort of directory structure like this:
>>
>> foocorp/business_data/business_reports/year_2008/input/foo
>> foocorp/business_data/business_reports/year_2008/input/bar
>>
>> Of course we "know" that there are no such things as directories in
>> S3. But people like to structure their object names as if there were.
>> In cases like that, TV's scheme only incurs the cost of creating the
>> extra dentries once per long prefix.
>>
> As I said above, for most cases reading xattrs should be more efficient.
>
>
> Yehuda
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html