On Fri, 4 Jun 2010, Henry C Chang wrote: > > preserves the current doubling logic but caps it at some value, so the > > admin can trade throughput vs quota precision. And/or we can make it also > > dynamic reduce that window as the user approaches the limit. > > Yes. But if there are multiple clients writing one subtree concurrently, > it is a little bit difficult to say if we are approaching the limit.... > we need to know how many clients are writing to the same subtree... I would suggest some sort of recursive 'nested_max_size_diff' accounting on each mds that works similarly to the nested_* values in CInode etc. Basically, fix up the cache link/unlink methods in CDir to adjust the recursive counts (like the anchor and auth_pin counters), and define some specific rules like: - max_size_diff for any given node is max_size - size (if max_size > size) - nested_max_size diff is max_size_diff + sum over children, if child does not have it's own recursive quota set The accounting can initially be done local to the mds, which means the extent to which clients can exceed their quota before getting shut down would increase if the quota subtree spans multiple mds's. (I don't think that's a big deal, personally, but it depends on how strict you want to be. Later we could devise some additional mechanism that allocates remaining quota space among nodes the region spans or something.) Maybe a similar counter can simply count how many open files are contributing to that sum, so you can tell a bit more about how that available space should be distributed...? > The export_dir command working well, and gives us a convenient way to test > multi-mds scenarios. Not surprisingly, our current implementation is not > working in mult-mds environment... :) > > My test setup: > Under mount point, I created /volume, /volume/aaa, /volume/bbb. > mds0 is authoritative for /volume, /volume/aaa. > mds1 is authoritative for /volume/bbb. > Quota is set on /volume: 250M > > Test case 0: pass > cp 100M file to /volume/aaa/a0 > cp 100M file to /volume/aaa/a1 > cp 100M file to /volume/aaa/a2 ==> quota exceeded error is expected here > > Test case 1: pass > cp 100M file to /volume/bbb/b0 > cp 100M file to /volume/bbb/b1 > cp 100M file to /volume/aaa/a1 ==> quota exceeded error is expected here > > Test case 2: failed > cp 100M file to /volume/bbb/b0 > cp 100M file to /volume/bbb/b1 > cp 100M file to /volume/bbb/b2 ==> quota exceeded error is expected here > > It seems that rstat can be propagated up (from mds1 to mds0) quickly (case 1); > however, the ancestor replica (/volume) in mds1 is not updated (case 2). > I wonder how/when the replicas get updated. I'm still digging the source code > to find where. :( There is a Locker::scatter_nudge() function that periodically twiddles the lock state on the subtree boundary so that the rstat information propagates betweeen nodes. There is an interval in g_conf that controls how often that happens... sage