Re: Possible SHD stalling

Ravishankar N <ravishankar@xxxxxxxxxx> · Fri, 8 May 2020 21:17:15 +0530

On 08/05/20 7:07 pm, Jaco Kroon wrote:

I'm not sure "stuck" is the right word, but looking at the "statistics 
heal-count" values it goes into a form of "go slow" mode, and ends up 
adding more entries for heal in some cases at a rate of about 2 every 
second, sometimes 4 at worst (based on ~10-15 second intervals 
depending on how long it takes to gather the heal-counts).

My guess currently is that since the log entries that scroll by 
relates to metadata and entry heals, these are when it's healing 
files, and when it hits a folder it somehow locks something, then 
iterates through the folder (very) slowly for some reason, and whilst 
this is happening, directory listings are outright unusable.

Is there lock profiling code available in glusterfs?

Nope, nothing as of today as far as I know. You can have the locks 
xlator log everything by setting the locks.trace option to ON though.

I saw this "issue": https://github.com/gluster/glusterfs/issues/275

And there was some related PR I can't track down now which aims to 
wrap pthread mutex_* calls with some other library to time locks etc 
... having looked at that code it made sense but could trivially 
introduce extra latency of a few micro-seconds even in the optimal 
case, and at worst could change lock behaviour (all _lock first does 
_trylock then re-issues _lock) - although I haven't fully analysed the 
code change.  And one would need to modify everywhere where mutex's 
are called, alternatively LD_PRELOAD hacks are useful, but then 
obtaining code locations where locks are initialised and used becomes 
much, much harder, but it would make an analysis of whether locks are 
contended on for extended periods of time (more than a few 
milliseconds) or held for extended times quite trivial, and I'm 
guessing backtrace() from execinfo could be helpful here to at least 
get an idea of which locks are involved.

This looks like the approach used by mutrace: 
http://0pointer.de/blog/projects/mutrace.html so might just end up 
using that.  Ironically enough if the author of that project just 
added a little bit more graph tracking code he could turn it into a 
deadlock detection tool as well.

I also see from the profile info on your first email that XATTROP fop 
has the highest latency. Can you do an strace of the brick processes 
to see if there is any syscall (setxattr in particular, for the 
XATTROP fop) that is taking higher than usual times?

I'm familiar with strace.  What am I looking for?  How will I identify 
XATTROP fops and their associated system calls in this trace?

XATTROP is a FOP meant for the index xlator. Index serializes this FOP 
and sends a setxattr to posix xlator and in the callback performs some 
link/unlinks operations inside .glusterfs/indices/. So I was wondering 
if the bottleneck is in the setxattr syscall itself (which can be found 
in the strace -i.e. the time taken for the syscall) or because of the 
serialization inside index. The xattr keys will be trusted.afr*. like so:

[pid 517071] fsetxattr(16, "trusted.afr.dirty", 
"\0\0\0\1\0\0\0\0\0\0\0", 12, 0) = 0 <0.000135>
[pid 517071] 
unlink("/home/ravi/bricks/brick1/.glusterfs/indices/xattrop/6ea17fe6-9080-4054-ab98-775d37ea143b") 
= 0 <0.000086>

But anyway you can check if there are any syscalls(not just setxattr) 
that take a long time on the drives.

For the slow ls from the client, an ongoing selfheal should not 
affect it, since readdir, stat etc are all read-only operations that 
do not take any locks that can compete with heal, and there is no 
hardware bottleneck at the brick side to process requests from what 
you say. Maybe a tcp dump between the client and the server can help 
find if the slowness is from the server side by looking at the 
request and response times of the FOPS in the dump.

Well, it does.  I've samples a few hundred ls's since my email this 
morning.  All of them were sub 0.5s. 
I guess you need to check when the heal does happen, *which* FOP is slow 
for the `ls -l` operation. Is it readdirp? stat?
The only difference is that all three SHDs were killed.  Almost 
without a doubt in my mind there has to be some unforeseen 
interaction. We've been having a few discussions, and a very, very 
long time ago, we effectively actioned a full heal with "find 
/path/to/fuse-mount -exec stat {} \;" and other times with "find 
/path/to/fuse-mount -print0 | xargs -0 -n50 -P5 -- stat" - and if we 
recall correctly we've seen similar there.  We thought at that time it 
was the FUSE process causing trouble, and it's entirely possible that 
that is indeed the case, but we are wondering now whether that too not 
maybe related to the heal process that was just happening due to the 
stats (heal on stat).  We generally run with cluster.*-self-heal off 
nowadays and rather rely purely on the SHD for performance reasons.

We've had complaints of client-side heal  (data/metadata/entry) stealing 
too much of i/o bandwidth  which is why turned the default values to off 
in AFR.

Thanks,
Ravi

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users