Re: subdir quotas

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 4 Jun 2010 09:42:57 -0700 (PDT)

On Fri, 4 Jun 2010, Henry C Chang wrote:
> > preserves the current doubling logic but caps it at some value, so the
> > admin can trade throughput vs quota precision.  And/or we can make it also
> > dynamic reduce that window as the user approaches the limit.
> 
> Yes. But if there are multiple clients writing one subtree concurrently, 
> it is a little bit difficult to say if we are approaching the limit.... 
> we need to know how many clients are writing to the same subtree...

I would suggest some sort of recursive 'nested_max_size_diff' accounting 
on each mds that works similarly to the nested_* values in CInode etc.  
Basically, fix up the cache link/unlink methods in CDir to adjust the 
recursive counts (like the anchor and auth_pin counters), and define some 
specific rules like:

 - max_size_diff for any given node is max_size - size (if max_size > 
size)
 - nested_max_size diff is max_size_diff + sum over children, if child 
does not have it's own recursive quota set

The accounting can initially be done local to the mds, which means the 
extent to which clients can exceed their quota before getting shut down 
would increase if the quota subtree spans multiple mds's.  (I don't think 
that's a big deal, personally, but it depends on how strict you want to 
be.  Later we could devise some additional mechanism that allocates 
remaining quota space among nodes the region spans or something.)

Maybe a similar counter can simply count how many open files are 
contributing to that sum, so you can tell a bit more about how that 
available space should be distributed...?

> The export_dir command working well, and gives us a convenient way to test
> multi-mds scenarios. Not surprisingly, our current implementation is not
> working in mult-mds environment... :)
> 
> My test setup:
> Under mount point, I created /volume, /volume/aaa, /volume/bbb.
>     mds0 is authoritative for /volume, /volume/aaa.
>     mds1 is authoritative for /volume/bbb.
> Quota is set on /volume: 250M
> 
> Test case 0: pass
> cp 100M file to /volume/aaa/a0
> cp 100M file to /volume/aaa/a1
> cp 100M file to /volume/aaa/a2  ==> quota exceeded error is expected here
> 
> Test case 1: pass
> cp 100M file to /volume/bbb/b0
> cp 100M file to /volume/bbb/b1
> cp 100M file to /volume/aaa/a1  ==> quota exceeded error is expected here
> 
> Test case 2: failed
> cp 100M file to /volume/bbb/b0
> cp 100M file to /volume/bbb/b1
> cp 100M file to /volume/bbb/b2  ==> quota exceeded error is expected here
> 
> It seems that rstat can be propagated up (from mds1 to mds0) quickly (case 1);
> however, the ancestor replica (/volume) in mds1 is not updated (case 2).
> I wonder how/when the replicas get updated. I'm still digging the source code
> to find where. :(

There is a Locker::scatter_nudge() function that periodically twiddles the 
lock state on the subtree boundary so that the rstat information 
propagates betweeen nodes.  There is an interval in g_conf that controls 
how often that happens...

sage