On Tue, Nov 12, 2013 at 01:15:00PM -0500, Qing Chang wrote:
Hi Stan,
thank you very much for your comments, I need some time to
follow through all of your suggestions.
Yes, this is a dovecot IMAP and sendmail SMTP server, and there
are 1360 mailboxes.
With this email, I'd like to give you a bit more information with
strace -T and others commands.
Because "# CONFIG_LOCK_STAT is not set", I can not run perf
lock record. Since this is a production server, I am reluctant to
rebuild the kernel on it. But I hope the info below may give
you an idea that it is not "lock" related. As slow "ls -l" and "du"
also happen on the two file servers (64TB and 32TB), it does not
feel like it is lock related.
I run this "(time strace -T ls -l /var/spool/mail > ls-out) >& ls-trace"
and got quite a bit data.
First "time" output:
=====
real 0m17.720s
user 0m0.574s
sys 0m1.271s
=====
Some 14s was spent on 164 "poll" calls on fd=5 that took more than 10ms,
total number of "poll" call is 6618!
So, think about it for a minute. What's that fd?
##### df=5 opened:
socket(PF_FILE, SOCK_STREAM, 0) = 5 <0.000016>
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) <0.000010>
fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000010>
fcntl(5, F_GETFD) = 0 <0.000010>
fcntl(5, F_SETFD, FD_CLOEXEC) = 0 <0.000010>
connect(5, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"}, 110) = 0 <0.001886>
fstat(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 <0.000010>
So, in some cases it's taking 10ms to get some information from the
other end of a pipe. ANd that pipe has something to do with SSSD.
poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, revents=POLLOUT}]) <0.000012>
sendto(5, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = 16 <0.000029>
poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, revents=POLLOUT}]) <0.000012>
sendto(5, "\1\0\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000027>
poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) <0.000011>
read(5, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000013>
poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) <0.000011>
read(5, "\1\0\0\0", 4) = 4 <0.000025>
poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, revents=POLLOUT}]) <0.000011>
sendto(5, "\24\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, 0) = 16 <0.000026>
poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, revents=POLLOUT}]) <0.000011>
sendto(5, "\225N\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000025>
poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) <0.000515>
read(5, "L\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000012>
poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) <0.000011>
read(5, "\1\0\0\0\0\0\0\0\225N\0\0\311\0\0\0qchang\0*\0Qing Ch"..., 60) = 60 <0.000026>
That looks like uid ->username and gid -> group name translations.
i.e. you've got a slow LDAP server and that is what is causing your
problems, not anything to do with XFS....