Re: CephFS Space Accounting and Quotas

"Jim Schutt" <jaschut@xxxxxxxxxx> · Wed, 6 Mar 2013 16:14:42 -0700

On 03/06/2013 02:39 PM, Greg Farnum wrote:
> On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote:
>> On 03/06/2013 01:21 PM, Greg Farnum wrote:
>>>>> Also, this issue of stat on files created on other clients seems
>>>>> like it's going to be problematic for many interactions our users
>>>>> will have with the files created by their parallel compute jobs -
>>>>> any suggestion on how to avoid or fix it?
>>>>  
>>>  
>>>  
>>> Brief background: stat is required to provide file size information,
>>> and so when you do a stat Ceph needs to find out the actual file
>>> size. If the file is currently in use by somebody, that requires
>>> gathering up the latest metadata from them. Separately, while Ceph
>>> allows a client and the MDS to proceed with a bunch of operations
>>> (ie, mknod) without having it go to disk first, it requires anything
>>> which is visible to a third party (another client) be durable on disk
>>> for consistency reasons.
>>>  
>>> These combine to mean that if you do a stat on a file which a client
>>> currently has buffered writes for, that buffer must be flushed out to
>>> disk before the stat can return. This is the usual cause of the slow
>>> stats you're seeing. You should be able to adjust dirty data
>>> thresholds to encourage faster writeouts, do fsyncs once a client is
>>> done with a file, etc in order to minimize the likelihood of running
>>> into this. Also, I'd have to check but I believe opening a file with
>>> LAZY_IO or whatever will weaken those requirements — it's probably
>>> not the solution you'd like here but it's an option, and if this
>>> turns out to be a serious issue then config options to reduce
>>> consistency on certain operations are likely to make their way into
>>> the roadmap. :)
>>  
>>  
>>  
>> That all makes sense.
>>  
>> But, it turns out the files in question were written yesterday,
>> and I did the stat operations today.
>>  
>> So, shouldn't the dirty buffer issue not be in play here?
> Probably not. :/
> 
> 
>> Is there anything else that might be going on?
> In that case it sounds like either there's a slowdown on disk access
> that is propagating up the chain very bizarrely, there's a serious
> performance issue on the MDS (ie, swapping for everything), or the
> clients are still holding onto capabilities for the files in question
> and you're running into some issues with the capability revocation
> mechanisms.
> Can you describe your setup a bit more? What versions are you
> running, kernel or userspace clients, etc. What config options are
> you setting on the MDS? Assuming you're on something semi-recent,
> getting a perfcounter dump from the MDS might be illuminating as
> well.

When I'm doing these stat operations the file system is otherwise
idle.

What is happening is that once one of these slow stat operations
on a file completes, it never happens again for that file, from
any client.  At least, that's the case if I'm not writing to
the file any more.  I haven't checked if appending to the files
restarts the behavior.

On the client side I'm running with 3.8.2 + the ceph patch queue
that was merged into 3.9-rc1.

On the server side I'm running recent next branch (commit 0f42eddef5),
with the tcp receive socket buffer option patches cherry-picked.
I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
rather than pg_bits to set initial number of PGs (same for pgp_num),
and a patch that lets me run with just one pool that contains both
data and metadata.  I'm testing data distribution uniformity with 512K PGs.

My MDS tunables are all at default settings.

> 
> We'll probably want to get a high-debug log of the MDS during these slow stats as well.

OK.

Do you want me to try to reproduce with a more standard setup?

Also,  I see Sage just pushed a patch to pgid decoding - I expect
I need that as well, if I'm running the latest client code.

Do you want the MDS log at 10 or 20?

-- Jim

> -Greg
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html