Re: MDS running at 100% CPU, no clients

Noah Watkins <jayhawk@xxxxxxxxxxx> · Thu, 7 Mar 2013 09:32:42 -0800

On Mar 7, 2013, at 9:24 AM, Greg Farnum <greg@xxxxxxxxxxx> wrote:

> This isn't bringing up anything in my brain, but I don't know what that _sample() function is actually doing — did you get any farther into it?

_sample reads /proc/self/maps in a loop until eof or some other conditions. i couldn't figure out if the thread was stuck in _sample or a level up. Anyhow, my gdb-foo isn't stellar and I managed to crash the mds. I'm gonna stick some log points in and try to reproduce it.

> -Greg
> 
> On Wednesday, March 6, 2013 at 6:23 PM, Noah Watkins wrote:
> 
>> Which, looks to be in a tight loop in the memory model _sample…
>> 
>> (gdb) bt
>> #0 0x00007f0270d84d2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
>> #1 0x00007f027046dd88 in std::__basic_file<char>::xsgetn(char*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #2 0x00007f027046f4c5 in std::basic_filebuf<char, std::char_traits<char> >::underflow() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #3 0x00007f0270467ceb in std::basic_istream<char, std::char_traits<char> >& std::getline<char, std::char_traits<char>, std::allocator<char> >(std::basic_istream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #4 0x000000000072bdd4 in MemoryModel::_sample(MemoryModel::snap*) ()
>> #5 0x00000000005658db in MDCache::check_memory_usage() ()
>> #6 0x00000000004ba929 in MDS::tick() ()
>> #7 0x0000000000794c65 in SafeTimer::timer_thread() ()
>> #8 0x00000000007958ad in SafeTimerThread::entry() ()
>> #9 0x00007f0270d7de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
>> 
>> On Mar 6, 2013, at 6:18 PM, Noah Watkins <jayhawk@xxxxxxxxxxx (mailto:jayhawk@xxxxxxxxxxx)> wrote:
>> 
>>> 
>>> On Mar 6, 2013, at 5:57 PM, Noah Watkins <jayhawk@xxxxxxxxxxx (mailto:jayhawk@xxxxxxxxxxx)> wrote:
>>> 
>>>> The MDS process in my cluster is running at 100% CPU. In fact I thought the cluster came down, but rather an ls was taking a minute. There aren't any clients active. I've left the process running in case there is any probing you'd like to do on it:
>>>> 
>>>> virt res cpu
>>>> 4629m 88m 5260 S 92 1.1 113:32.79 ceph-mds
>>>> 
>>>> Thanks,
>>>> Noah
>>> 
>>> 
>>> 
>>> 
>>> This is a ceph-mds child thread under strace. The only thread
>>> that appears to be doing anything.
>>> 
>>> root@issdm-44:/home/hadoop/hadoop-common# strace -p 3372
>>> Process 3372 attached - interrupt to quit
>>> read(1649, "7f0203235000-7f0203236000 ---p 0"..., 8191) = 4050
>>> read(1649, "7f0205053000-7f0205054000 ---p 0"..., 8191) = 4050
>>> read(1649, "7f0206e71000-7f0206e72000 ---p 0"..., 8191) = 4050
>>> read(1649, "7f0214144000-7f0214244000 rw-p 0"..., 8191) = 4020
>>> read(1649, "7f0215f62000-7f0216062000 rw-p 0"..., 8191) = 4020
>>> read(1649, "7f0217d80000-7f0217e80000 rw-p 0"..., 8191) = 4020
>>> read(1649, "7f0219b9e000-7f0219c9e000 rw-p 0"..., 8191) = 4020
>>> ...
>>> 
>>> That file looks to be:
>>> 
>>> ceph-mds 3337 root 1649r REG 0,3 0 266903 /proc/3337/maps
>>> 
>>> (3337 is the parent process).
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html