Re: subdir quotas

Henry C Chang <henry_c_chang@xxxxxxxxxxxxxxxxxxx> · Fri, 4 Jun 2010 11:19:24 +0800

Hi Sage,

On Thu, Jun 3, 2010 at 3:48 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Wed, 2 Jun 2010, Henry C Chang wrote:
>> Yes, I am interested and that's what I am doing right now.
>> In fact, we have a clone of ceph on github, and have had a "quick"
>> implementation already. You can get it from:
>>
>> http://github.com/tcloud/ceph/tree/folder-quota
>> http://github.com/tcloud/ceph-client-standalone/tree/folder-quota
>
> Oh, cool.  I'll take a look at this today.
>
>> To allow switching quota on/off, we added the option/configuration on both
>> client and server sides. To enable folder quota, you need to mount ceph with
>> "-o folder_quota=1" on client side. On server side, you need to add
>> "folder quota = 1" in the global section of ceph config file. We also
>> implemented a tool to set/unset/get/list quota limits on folders.
>>
>> To enforce the quota more precisely, our imlementation, however, sacrifies
>> the writing throughput and introduces more traffic:
>>
>> 1. We modified the max_size request-reply behaviour between client and mds.
>>    Our client requests a new max_size only when endoff > max_size. (i.e., it
>>    will not pre-request a larger max-size when it's approached the max_size.)
>>
>> 2. Our client requests a constant 4 MB (the object size) every time. This
>>    degrades the throughput significantly. (It used to request more and more.)
>
> Is this just to reduce the amount by which we might overshoot?  I would
> try to make it a tunable, maybe ('max size slop' or something) so that it

Great!

> preserves the current doubling logic but caps it at some value, so the
> admin can trade throughput vs quota precision.  And/or we can make it also
> dynamic reduce that window as the user approaches the limit.

Yes. But if there are multiple clients writing one subtree concurrently, it is
a little bit difficult to say if we are approaching the limit.... we
need to know
how many clients are writing to the same subtree...

>
>> Anyway, it is the initial implementation. I will take your comments into
>> consideration and try to revise the current implementation. Of course, I will
>> need your help on rstat propagation issue 'cause I have no clue right now
>> and have to dig the mds source code more to understand the existing
>> implementation. :)
>
> Sure.
>
>> A few questions about ceph testing:
>> - When will a subtree be fragmented?
>> - Can I force a subtree to be framented to faciliate testing?
>
> By default the load balancer goes every 30 seconds.  You can turn on mds
> 'thrashing' that will export random directories to random nodes (to stress
> test the migration), but that is probably overkill.
>
> It would probably be best to add something to MDS.cc's handle_command that
> lets the admin explicit initiate a subtree migration, via something like
>
>  $ ceph mds tell 0 export_dir /foo/bar 2    # send /foo/bar from mds0 to 2
>
> I just pushed something to do that to unstable.. let me know if you run
> into problems with it.
>

The export_dir command working well, and gives us a convenient way to test
multi-mds scenarios. Not surprisingly, our current implementation is not
working in mult-mds environment... :)

My test setup:
Under mount point, I created /volume, /volume/aaa, /volume/bbb.
    mds0 is authoritative for /volume, /volume/aaa.
    mds1 is authoritative for /volume/bbb.
Quota is set on /volume: 250M

Test case 0: pass
cp 100M file to /volume/aaa/a0
cp 100M file to /volume/aaa/a1
cp 100M file to /volume/aaa/a2  ==> quota exceeded error is expected here

Test case 1: pass
cp 100M file to /volume/bbb/b0
cp 100M file to /volume/bbb/b1
cp 100M file to /volume/aaa/a1  ==> quota exceeded error is expected here

Test case 2: failed
cp 100M file to /volume/bbb/b0
cp 100M file to /volume/bbb/b1
cp 100M file to /volume/bbb/b2  ==> quota exceeded error is expected here

It seems that rstat can be propagated up (from mds1 to mds0) quickly (case 1);
however, the ancestor replica (/volume) in mds1 is not updated (case 2).
I wonder how/when the replicas get updated. I'm still digging the source code
to find where. :(

Henry
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html