Re: CephFS Space Accounting and Quotas

"Jim Schutt" <jaschut@xxxxxxxxxx> · Mon, 11 Mar 2013 10:48:55 -0600

On 03/11/2013 09:48 AM, Greg Farnum wrote:
> On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
>> On 03/08/2013 07:05 PM, Greg Farnum wrote:
>>> On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
>>>> On 03/07/2013 08:15 AM, Jim Schutt wrote:
>>>>> On 03/06/2013 05:18 PM, Greg Farnum wrote:
>>>>>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
>>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>> [snip]
>>>>  
>>>>>>> Do you want the MDS log at 10 or 20?
>>>>>>  
>>>>>> More is better. ;)
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> OK, thanks.
>>>>  
>>>>  
>>>> I've sent some mds logs via private email...
>>>>  
>>>> -- Jim  
>>>  
>>> I'm going to need to probe into this a bit more, but on an initial
>>> examination I see that most of your stats are actually happening very
>>> quickly — it's just that occasionally they take quite a while.
>>  
>>  
>>  
>> Interesting...
>>  
>>> Going
>>> through the MDS log for one of those, the inode in question is
>>> flagged with "needsrecover" from its first appearance in the log —
>>> that really shouldn't happen unless a client had write caps on it and
>>> the client disappeared. Any ideas? The slowness is being caused by
>>> the MDS going out and looking at every object which could be in the
>>> file — there are a lot since the file has a listed size of 8GB.
>>  
>>  
>>  
>> For this run, the MDS logging slowed it down enough to cause the
>> client caps to occasionally go stale. I don't think it's the cause
>> of the issue, because I was having it before I turned MDS debugging
>> up. My client caps never go stale at, e.g., debug mds 5.
> 
> Oh, so this might be behaviorally different than you were seeing before? Drat.
> 
> You had said before that each newfstatat was taking tens of seconds,
> whereas in the strace log you sent along most of the individual calls
> were taking a bit less than 20 milliseconds. Do you have an strace of
> them individually taking much more than that, or were you just
> noticing that they took a long time in aggregate?

When I did the first strace, I didn't turn on timestamps, and I was
watching it scroll by.  I saw several stats in a row take ~30 secs,
at which point I got bored, and took a look at the strace man page to
figure out how to get timestamps ;)

Also, another difference is for that test, I was looking at files
I had written the day before, whereas for the strace log I sent,
there was only several minutes between writing and the strace of find.

I thought I had eliminated the page cache issue by using fdatasync
when writing the files.  Perhaps the real issue is affected by that
delay?

> I suppose if you were going to run it again then just the message
> logging could also be helpful. That way we could at least check and
> see the message delays and if the MDS is doing other work in the
> course of answering a request.

I can do as many trials as needed to isolate the issue.

What message debugging level is sufficient on the MDS; 1?

If you want I can attempt to duplicate my memory of the first
test I reported, writing the files today and doing the strace
tomorrow (with timestamps, this time).

Also, would it be helpful to write the files with minimal logging, in
hopes of inducing minimal timing changes, then upping the logging
for the stat phase?

> 
>> Otherwise, there were no signs of trouble while writing the files.
>>  
>> Can you suggest which kernel client debugging I might enable that
>> would help understand what is happening? Also, I have the full
>> MDS log from writing the files, if that will help. It's big (~10 GiB).
>>  
>>> (There are several other mysteries here that can probably be traced
>>> to different varieties of non-optimal and buggy code as well — there
>>> is a client which has write caps on the inode in question despite it
>>> needing recovery, but the recovery isn't triggered until the stat
>>> event occurs, etc).
>>  
>>  
>>  
>> OK, thanks for taking a look. Let me know if there is other
>> logging I can enable that will be helpful.
> 
> I'm going to want to spend more time with the log I've got, but I'll think about if there's a different set of data we can gather less disruptively.  

OK, cool.  Just let me know.

Thanks -- Jim

> -Greg
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html