I've hit the limit of my expertise on this one. Can somebody help us get over the finish line? I think we're pretty close. On 11/12/2013 12:15 PM, Qing Chang wrote: > Hi Stan, > > thank you very much for your comments, I need some time to > follow through all of your suggestions. > > Yes, this is a dovecot IMAP and sendmail SMTP server, and there > are 1360 mailboxes. > > With this email, I'd like to give you a bit more information with > strace -T and others commands. > > Because "# CONFIG_LOCK_STAT is not set", I can not run perf > lock record. Since this is a production server, I am reluctant to > rebuild the kernel on it. But I hope the info below may give > you an idea that it is not "lock" related. As slow "ls -l" and "du" > also happen on the two file servers (64TB and 32TB), it does not > feel like it is lock related. Yeah, the locking was just an initial WAG without enough thought on my part. Upon reflection the directory inodes shouldn't be locked and that's what ls and du work with. > I run this "(time strace -T ls -l /var/spool/mail > ls-out) >& ls-trace" > and got quite a bit data. > > First "time" output: > ===== > real 0m17.720s > user 0m0.574s > sys 0m1.271s > ===== > > Some 14s was spent on 164 "poll" calls on fd=5 that took more than 10ms, > total number of "poll" call is 6618! But I think this strace data definitely tells us something, as it shows us where the additional time is being spent. > Below is the trace output, red numbers are the three longest time taken, > listed > at top are those that took more than 100ms, the three sections are for > the top > three time taken, note the three mailboxes are not special compare to > another > mailboxes, a run at another time would produce calls on different > mailboxes. > Also, cache seams to play a role too, after a fresh run that takes long > time, > subsequent runs give instant results: On a mailbox server with 32GB RAM and only 1360 mailboxes this makes me wonder why the directory metadata for these files is being evicted from inode cache. Makes me wonder if vfs_cache_pressure or other has been modified. But why does it take more than 10s when reading them in from disk? I don't know what the poll system call does. I'm at the limit of my expertise here Qing. Hopefully one of the XFS experts can jump in and get you across the finish line. At least we now know which system call is eating up all the time. And we know there is an inode cache retention issue, or at least there seems to be. > ===== > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.144634> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.104580> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.104812> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.181959> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.102520> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.107539> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.142462> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.215745> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.309854> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.166665> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.101035> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.113119> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.120745> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.110027> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.137906> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.249039> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.104079> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.250003> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.103791> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.102018> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.103551> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.103696> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.102164> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.105843> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.104413> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.101304> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.101869> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.101366> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.132958> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.309411> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.100342> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.101279> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.144725> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.109867> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.109951> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.290189> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.102449> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.101182> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.111699> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.107351> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.104558> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.115447> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.163904> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.266427> > > ##### df=5 opened: > socket(PF_FILE, SOCK_STREAM, 0) = 5 <0.000016> > fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) <0.000010> > fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000010> > fcntl(5, F_GETFD) = 0 <0.000010> > fcntl(5, F_SETFD, FD_CLOEXEC) = 0 <0.000010> > connect(5, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"}, 110) = 0 > <0.001886> > fstat(5, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 <0.000010> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000012> > sendto(5, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, > 0) = 16 <0.000029> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000012> > sendto(5, "\1\0\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000027> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000011> > read(5, "\24\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000013> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000011> > read(5, "\1\0\0\0", 4) = 4 <0.000025> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000011> > sendto(5, "\24\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, > 0) = 16 <0.000026> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000011> > sendto(5, "\225N\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000025> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000515> > read(5, "L\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000012> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000011> > read(5, "\1\0\0\0\0\0\0\0\225N\0\0\311\0\0\0qchang\0*\0Qing Ch"..., 60) > = 60 <0.000026> > open("/etc/group", O_RDONLY|O_CLOEXEC) = 6 <0.000017> > fstat(6, {st_mode=S_IFREG|0644, st_size=790, ...}) = 0 <0.000010> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x7fd5d53f4000 <0.004539> > read(6, "root:x:0:root\nbin:x:1:root,bin,d"..., 4096) = 790 <0.000018> > read(6, "", 4096) = 0 <0.000011> > close(6) = 0 <0.000011> > munmap(0x7fd5d53f4000, 4096) = 0 <0.000016> > open("/var/lib/sss/mc/group", O_RDONLY|O_CLOEXEC) = 6 <0.000021> > fstat(6, {st_mode=S_IFREG|0644, st_size=5206312, ...}) = 0 <0.000010> > mmap(NULL, 5206312, PROT_READ, MAP_SHARED, 6, 0) = 0x7fd5ce3c1000 > <0.000015> > poll([{fd=5, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000012> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000011> > sendto(5, "\24\0\0\0\"\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, > 0) = 16 <0.000031> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000011> > sendto(5, "\311\0\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000025> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000427> > read(5, "?\0\0\0\"\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000013> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000011> > read(5, "\1\0\0\0\0\0\0\0\311\0\0\0\3\0\0\0\0*\0qiu"..., 47) = 47 > <0.000017> > lstat("/var/spool/mail/root", {st_mode=S_IFREG|0600, st_size=187649403, > ...}) = 0 <0.000019> > lgetxattr("/var/spool/mail/root", "security.selinux", > "unconfined_u:object_r:file_t:s0", 255) = 32 <0.000020> > lstat("/var/spool/mail/root", {st_mode=S_IFREG|0600, st_size=187649403, > ...}) = 0 <0.000014> > lgetxattr("/var/spool/mail/root", "system.posix_acl_access", 0x0, 0) = > -1 ENODATA (No data available) <0.000018> > lgetxattr("/var/spool/mail/root", "system.posix_acl_default", 0x0, 0) = > -1 ENODATA (No data available) <0.000016> > lstat("/var/spool/mail/agutstei", {st_mode=S_IFREG|0600, > st_size=3864251, ...}) = 0 <0.000016> > lgetxattr("/var/spool/mail/agutstei", "security.selinux", > "unconfined_u:object_r:file_t:s0", 255) = 32 <0.000018> > lstat("/var/spool/mail/agutstei", {st_mode=S_IFREG|0600, > st_size=3864251, ...}) = 0 <0.000015> > lgetxattr("/var/spool/mail/agutstei", "system.posix_acl_access", 0x0, 0) > = -1 ENODATA (No data available) <0.000016> > lgetxattr("/var/spool/mail/agutstei", "system.posix_acl_default", 0x0, > 0) = -1 ENODATA (No data available) <0.000015> > open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 7 <0.000017> > fstat(7, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0 <0.000011> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x7fd5ce3c0000 <0.000014> > read(7, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1835 <0.000020> > read(7, "", 4096) = 0 <0.000011> > close(7) = 0 <0.000012> > > ###### took 0.309854s: > munmap(0x7fd5ce3c0000, 4096) = 0 <0.000015> > poll([{fd=5, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000011> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000010> > sendto(5, "\24\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, > 0) = 16 <0.000034> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000010> > sendto(5, "\253\260\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000335> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.309854> > read(5, "N\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000012> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000010> > read(5, "\1\0\0\0\0\0\0\0\253\260\0\0\253\1\0\0acqu\0*\0Chen Ge A"..., > 62) = 62 <0.000016> > lstat("/var/spool/mail/araouf", {st_mode=S_IFREG|0600, st_size=3943191, > ...}) = 0 <0.000052> > lgetxattr("/var/spool/mail/araouf", "security.selinux", > "unconfined_u:object_r:file_t:s0", 255) = 32 <0.000021> > lstat("/var/spool/mail/araouf", {st_mode=S_IFREG|0600, st_size=3943191, > ...}) = 0 <0.000014> > lgetxattr("/var/spool/mail/araouf", "system.posix_acl_access", 0x0, 0) = > -1 ENODATA (No data available) <0.000017> > lgetxattr("/var/spool/mail/araouf", "system.posix_acl_default", 0x0, 0) > = -1 ENODATA (No data available) <0.000015> > open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 7 <0.000016> > fstat(7, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0 <0.000010> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x7fd5ce3c0000 <0.000012> > read(7, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1835 <0.000018> > read(7, "", 4096) = 0 <0.000010> > close(7) = 0 <0.000012> > > ###### took 0.309411: > munmap(0x7fd5ce3c0000, 4096) = 0 <0.000015> > poll([{fd=5, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000010> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000009> > sendto(5, "\24\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, > 0) = 16 <0.000035> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000010> > sendto(5, "\357\246\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000342> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.309411> > read(5, "V\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000012> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000010> > read(5, "\1\0\0\0\0\0\0\0\357\246\0\0\253\1\0\0pthaya\0*\0Parthip"..., > 70) = 70 <0.000018> > lstat("/var/spool/mail/ptsao", {st_mode=S_IFREG|0600, st_size=515358, > ...}) = 0 <0.000025> > lgetxattr("/var/spool/mail/ptsao", "security.selinux", > "unconfined_u:object_r:file_t:s0", 255) = 32 <0.000021> > lstat("/var/spool/mail/ptsao", {st_mode=S_IFREG|0600, st_size=515358, > ...}) = 0 <0.000014> > lgetxattr("/var/spool/mail/ptsao", "system.posix_acl_access", 0x0, 0) = > -1 ENODATA (No data available) <0.000019> > lgetxattr("/var/spool/mail/ptsao", "system.posix_acl_default", 0x0, 0) = > -1 ENODATA (No data available) <0.000014> > open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 7 <0.000017> > fstat(7, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0 <0.000010> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x7fd5ce3c0000 <0.000012> > read(7, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1835 <0.000019> > read(7, "", 4096) = 0 <0.000009> > close(7) = 0 <0.000011> > > ###### took 0.290189: > munmap(0x7fd5ce3c0000, 4096) = 0 <0.000015> > poll([{fd=5, events=POLLIN|POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000011> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000010> > sendto(5, "\24\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16, MSG_NOSIGNAL, NULL, > 0) = 16 <0.000034> > poll([{fd=5, events=POLLOUT}], 1, 300000) = 1 ([{fd=5, > revents=POLLOUT}]) <0.000010> > sendto(5, "\10\270\0\0", 4, MSG_NOSIGNAL, NULL, 0) = 4 <0.000391> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.290189> > read(5, "N\0\0\0\22\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 <0.000012> > poll([{fd=5, events=POLLIN}], 1, 300000) = 1 ([{fd=5, revents=POLLIN}]) > <0.000010> > read(5, "\1\0\0\0\0\0\0\0\10\270\0\0\327\1\0\0taberk\0*\0Thomas "..., > 62) = 62 <0.000017> > lstat("/var/spool/mail/taha", {st_mode=S_IFREG|0600, st_size=293648143, > ...}) = 0 <0.000018> > lgetxattr("/var/spool/mail/taha", "security.selinux", > "unconfined_u:object_r:file_t:s0", 255) = 32 <0.000022> > lstat("/var/spool/mail/taha", {st_mode=S_IFREG|0600, st_size=293648143, > ...}) = 0 <0.000013> > lgetxattr("/var/spool/mail/taha", "system.posix_acl_access", 0x0, 0) = > -1 ENODATA (No data available) <0.000021> > lgetxattr("/var/spool/mail/taha", "system.posix_acl_default", 0x0, 0) = > -1 ENODATA (No data available) <0.000014> > open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 7 <0.000021> > fstat(7, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0 <0.000010> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x7fd5ce3c0000 <0.000015> > read(7, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1835 <0.000033> > read(7, "", 4096) = 0 <0.000011> > close(7) = 0 <0.000013> > ===== > > ====== iostat around the time ls -l was run ===== > [root@dovecot2 ~]# iostat -xmz 5 > Linux 2.6.32-358.18.1.el6.x86_64 (dovecot2.sri.utoronto.ca) > 11/12/2013 _x86_64_ (16 CPU) > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.56 0.00 0.17 0.06 0.00 99.21 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.01 19.88 0.14 26.86 0.00 0.18 > 14.18 0.02 0.57 0.28 0.75 > sdb 0.23 0.08 1.75 1.98 0.04 0.02 > 30.38 0.01 2.81 1.26 0.47 > sdc 0.00 0.03 13.78 0.71 1.34 0.05 > 196.35 0.02 1.60 0.77 1.11 > dm-0 0.00 0.00 0.09 2.42 0.00 0.01 > 8.49 0.00 1.18 0.52 0.13 > dm-1 0.00 0.00 0.00 0.00 0.00 0.00 > 8.00 0.00 3.09 0.27 0.00 > dm-2 0.00 0.00 0.06 44.33 0.00 0.17 > 8.14 0.04 0.90 0.14 0.63 > > avg-cpu: %user %nice %system %iowait %steal %idle > 1.31 0.00 0.46 0.10 0.00 98.12 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 27.40 1.00 65.80 0.02 0.36 > 11.78 0.04 0.61 0.43 2.86 > sdb 0.00 0.00 1.80 0.00 0.03 0.00 > 31.11 0.01 6.44 2.00 0.36 > sdc 0.00 1.00 0.00 4.80 0.00 0.03 > 11.79 0.03 6.88 0.46 0.22 > dm-0 0.00 0.00 0.80 1.20 0.02 0.00 > 24.00 0.01 6.10 6.10 1.22 > dm-2 0.00 0.00 0.00 92.00 0.00 0.36 > 8.00 0.06 0.62 0.19 1.76 > > avg-cpu: %user %nice %system %iowait %steal %idle > 5.76 0.00 1.62 0.59 0.00 92.03 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 1045.60 0.80 959.40 0.05 7.83 > 16.82 0.44 0.46 0.21 19.98 > sdb 0.00 0.00 1.40 1.20 0.01 0.01 > 9.23 0.00 0.62 0.54 0.14 > sdc 0.00 0.00 0.00 0.60 0.00 0.01 > 18.67 0.00 0.33 0.33 0.02 > dm-0 0.00 0.00 0.80 8.00 0.05 0.03 > 20.00 0.02 2.05 1.41 1.24 > dm-2 0.00 0.00 0.00 1997.00 0.00 7.80 > 8.00 1.24 0.62 0.09 18.86 > > avg-cpu: %user %nice %system %iowait %steal %idle > 4.38 0.00 1.29 0.67 0.00 93.66 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 1204.40 0.00 1074.60 0.00 8.90 > 16.97 0.41 0.38 0.20 21.04 > sdb 0.20 0.00 2.00 0.00 0.01 0.00 > 13.60 0.01 6.80 4.00 0.80 > dm-0 0.00 0.00 0.00 9.00 0.00 0.04 > 8.00 0.01 1.27 0.42 0.38 > dm-2 0.00 0.00 0.00 2270.00 0.00 8.87 > 8.00 1.33 0.59 0.09 20.74 > > avg-cpu: %user %nice %system %iowait %steal %idle > 4.24 0.00 1.16 0.58 0.00 94.02 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 1159.20 0.00 985.80 0.00 8.38 > 17.41 0.34 0.35 0.18 17.84 > sdb 0.00 0.00 0.20 1.40 0.00 0.01 > 8.62 0.00 0.75 0.75 0.12 > sdc 0.00 0.00 0.00 0.20 0.00 0.00 > 16.00 0.00 1.00 1.00 0.02 > dm-0 0.00 0.00 0.00 5.00 0.00 0.02 > 8.00 0.00 0.24 0.24 0.12 > dm-2 0.00 0.00 0.00 2140.00 0.00 8.36 > 8.00 1.23 0.58 0.08 17.90 > > avg-cpu: %user %nice %system %iowait %steal %idle > 2.77 0.00 1.14 0.54 0.00 95.55 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 768.20 0.00 847.40 0.00 6.31 > 15.25 0.36 0.43 0.19 16.40 > sdb 0.00 0.00 0.00 2.00 0.00 0.01 > 7.00 0.00 0.10 0.10 0.02 > sdc 0.00 0.00 0.00 2.00 0.00 0.04 > 37.60 0.00 0.70 0.40 0.08 > dm-0 0.00 0.00 0.00 18.80 0.00 0.07 > 8.00 0.09 4.84 0.27 0.50 > dm-2 0.00 0.00 0.00 1596.80 0.00 6.24 > 8.00 0.84 0.52 0.10 16.22 > ===== > > ===== vmstat ===== > [root@dovecot2 ~]# vmstat 5 > procs -----------memory---------- ---swap-- -----io---- --system-- > -----cpu----- > r b swpd free buff cache si so bi bo in cs us sy > id wa st > 2 0 3476 785664 223836 29329692 0 0 88 16 1 1 1 0 > 99 0 0 > 1 0 3476 783512 223844 29330448 0 0 84 3383 2210 3985 3 1 > 96 0 0 > 0 1 3476 779328 223848 29331208 0 0 20 9254 4530 9894 5 2 > 92 1 0 > 2 0 3476 778616 223856 29331956 0 0 1 9104 4160 10909 4 > 1 94 1 0 > 1 0 3476 777584 223864 29332932 0 0 0 10056 4428 10263 4 > 1 94 1 0 > 0 0 3476 778312 223872 29333044 0 0 2 660 921 1871 1 0 > 99 0 0 > ===== > > Really appreciate your willingness to help and your time spent on this! > > Qing > > On 04/11/2013 1:42 AM, Stan Hoeppner wrote: >> On 11/1/2013 2:11 PM, Qing Chang wrote: >>> I have collected some info following the FAQ guide. It may still >>> not be enough. Please let me know. >> You say /var/spool/mail contains 1360 "items". That's ambiguous. You >> mean 1360 user mbox files, yes? >> >>> # time to run ls -l (1360 items): >>> time ls -l /var/spool/mail >>> real 0m10.271s >>> user 0m0.097s >>> sys 0m0.162s >> The combined CPU execution time is apparently 0.259s. Something else is >> adding 10s+ of delay. mbox is notorious for locking problems. I wonder >> if stat() is waiting on release of exclusive write locks being held by >> your MDA during delivery. This would also tend to explain the >> variability you demonstrated, with one run being 10s and another being >> 7s--fewer files locked on the second run. >> >> Run 'ls' with "strace -T" to find out which system call is adding the >> delay. Run perf-lock concurrently to see the locks. >> >> More comments below. >> >> >>> # Oct 29, 2013 >>> # >>> # XFS slowness with ls du etc. >>> >>> kernel version (uname -a) >>> Linux dovecot2.sri.utoronto.ca 2.6.32-358.18.1.el6.x86_64 #1 SMP Fri >>> Aug 2 17:04:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux >>> >>> xfsprogs version (xfs_repair -V) >>> xfs_repair version 3.1.1 >>> >>> number of CPUs: 16 virtual CPUs >>> >>> contents of /proc/meminfo >>> MemTotal: 32878632 kB >>> MemFree: 310260 kB >>> Buffers: 220496 kB >>> Cached: 29971428 kB >>> SwapCached: 212 kB >>> Active: 9726912 kB >>> Inactive: 21180124 kB >>> Active(anon): 385040 kB >>> Inactive(anon): 330260 kB >>> Active(file): 9341872 kB >>> Inactive(file): 20849864 kB >>> Unevictable: 0 kB >>> Mlocked: 0 kB >>> SwapTotal: 6160376 kB >>> SwapFree: 6159896 kB >>> Dirty: 772 kB >>> Writeback: 0 kB >>> AnonPages: 714912 kB >>> Mapped: 52696 kB >>> Shmem: 184 kB >>> Slab: 1045988 kB >>> SReclaimable: 776440 kB >>> SUnreclaim: 269548 kB >>> KernelStack: 8736 kB >>> PageTables: 57172 kB >>> NFS_Unstable: 0 kB >>> Bounce: 0 kB >>> WritebackTmp: 0 kB >>> CommitLimit: 22599692 kB >>> Committed_AS: 1148180 kB >>> VmallocTotal: 34359738367 kB >>> VmallocUsed: 335224 kB >>> VmallocChunk: 34333710552 kB >>> HardwareCorrupted: 0 kB >>> AnonHugePages: 180224 kB >>> HugePages_Total: 0 >>> HugePages_Free: 0 >>> HugePages_Rsvd: 0 >>> HugePages_Surp: 0 >>> Hugepagesize: 2048 kB >>> DirectMap4k: 10240 kB >>> DirectMap2M: 33544192 kB >>> >>> contents of /proc/mounts >>> [root@dovecot2 ~]# cat /proc/mounts >>> rootfs / rootfs rw 0 0 >>> proc /proc proc rw,relatime 0 0 >>> sysfs /sys sysfs rw,seclabel,relatime 0 0 >>> devtmpfs /dev devtmpfs >>> rw,seclabel,relatime,size=16429456k,nr_inodes=4107364,mode=755 0 0 >>> devpts /dev/pts devpts >>> rw,seclabel,relatime,gid=5,mode=620,ptmxmode=000 0 0 >>> tmpfs /dev/shm tmpfs rw,seclabel,relatime 0 0 >>> /dev/mapper/vg_ipa1-lv_root / ext4 >>> rw,seclabel,relatime,barrier=1,data=ordered 0 0 >>> none /selinux selinuxfs rw,relatime 0 0 >>> devtmpfs /dev devtmpfs >>> rw,seclabel,relatime,size=16429456k,nr_inodes=4107364,mode=755 0 0 >>> /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 >>> /dev/sda1 /boot ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0 >>> /dev/mapper/vg_ipa1-lv_home /var ext4 >>> rw,seclabel,relatime,barrier=1,data=ordered 0 0 >>> /dev/sdb1 /mailhomes xfs >>> rw,seclabel,relatime,attr2,delaylog,nobarrier,inode64,usrquota 0 0 >>> /dev/sdc1 /var/spool/mail xfs >>> rw,seclabel,relatime,attr2,delaylog,nobarrier,inode64,usrquota 0 0 >>> none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 >>> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 >>> >>> contents of /proc/partitions >>> major minor #blocks name >>> >>> 8 0 83886080 sda >>> 8 1 512000 sda1 >>> 8 2 83373056 sda2 >>> 8 16 5368709120 sdb >>> 8 17 5368707072 sdb1 >>> 8 32 2147483648 sdc >>> 8 33 2147481600 sdc1 >>> 253 0 56729600 dm-0 >>> 253 1 6160384 dm-1 >>> 253 2 20480000 dm-2 >>> >>> RAID layout (hardware and/or software): Hardware RAID6 IBM Storwize >>> V7000 >> What we're looking for here is the RAID geometry of the local disk >> device, logical volume, SAN LUN, etc, underlying the XFS filesystem, >> i.e. number of disks in the array, RAID level, stripe unit and stripe >> width. >> >>> LVM configuration: XFS not configured >>> >>> type of disks you are using: SAS dirves >>> >>> write cache status of drives: V7000 has 16 GB of cache >>> >>> size of BBWC and mode it is running in: N/A >> The 16GB of BBWC in the V7000 defaults to write-back mode, or has been >> manually enabled. If not you're playing Russian roulette, because >> you're mounting with nobarrier. You only disable barriers if you have >> persistent write cache, either battery or flash based. >> >>> xfs_info output on the filesystem in question >>> meta-data=/dev/sdc1 isize=256 agcount=4, >>> agsize=134217600 blks >>> = sectsz=512 attr=2 >>> data = bsize=4096 blocks=536870400, >>> imaxpct=5 >>> = sunit=0 swidth=0 blks >>> naming =version 2 bsize=4096 ascii-ci=0 >>> log =internal bsize=4096 blocks=262143, version=2 >>> = sectsz=512 sunit=0 blks, >>> lazy-count=1 >>> realtime =none extsz=4096 blocks=0, rtextents=0 >> 2TB seems rather large for a 1360 user mail spool directory, given new >> mail is apparently moved from INBOX to other folders in /mailhomes. >> >> Additionally, given the innate parallelism in a mail workload, 4 AGs is >> far less than optimal. XFS throughput with a highly parallel workload >> is limited by the number of allocation groups. This filesystem was >> apparently created with mkfs.xfs defaults. It should have been created >> with many more AGs, 16 minimum. Same goes for the /mailhomes filesystem. >> >>> dmesg output showing all error messages and stack traces: N/A >>> >>> # time to run ls -l (1360 items): >>> time ls -l /var/spool/mail >>> real 0m10.271s >>> user 0m0.097s >>> sys 0m0.162s >>> >>> # iostat when ls -l is running: >>> [root@dovecot2 ~]# date;iostat -xdmz 5 >>> Tue Oct 29 14:44:28 EDT 2013 >>> Linux 2.6.32-358.18.1.el6.x86_64 (dovecot2.sri.utoronto.ca) >>> 10/29/2013 _x86_64_ (16 CPU) >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.02 17.87 0.17 24.49 0.01 0.17 >>> 14.16 0.01 0.56 0.28 0.68 >>> sdb 0.19 0.08 1.62 1.94 0.04 0.02 >>> 33.47 0.01 2.76 1.23 0.44 >>> sdc 0.00 0.03 11.57 0.69 1.10 0.05 >>> 192.23 0.02 1.65 0.78 0.96 >>> dm-0 0.00 0.00 0.11 2.24 0.00 0.01 >>> 8.70 0.00 1.14 0.51 0.12 >>> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 >>> 8.00 0.00 1.26 0.24 0.00 >>> dm-2 0.00 0.00 0.08 40.11 0.00 0.16 >>> 8.18 0.03 0.86 0.14 0.57 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 155.00 0.00 127.20 0.00 1.10 >>> 17.75 0.06 0.45 0.25 3.22 >>> sdb 0.00 0.20 0.00 11.60 0.00 0.07 >>> 11.81 0.05 4.07 0.17 0.20 >>> sdc 0.00 0.00 0.00 0.40 0.00 0.00 >>> 12.00 0.00 0.50 0.50 0.02 >>> dm-0 0.00 0.00 0.00 1.20 0.00 0.00 >>> 8.00 0.01 4.50 4.50 0.54 >>> dm-2 0.00 0.00 0.00 281.40 0.00 1.10 >>> 8.00 0.19 0.68 0.10 2.74 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 1480.80 0.00 627.20 0.00 8.23 >>> 26.89 0.28 0.45 0.20 12.70 >>> sdb 0.00 0.00 0.40 4.00 0.00 0.03 >>> 18.27 0.00 1.05 1.05 0.46 >>> sdc 0.00 0.00 0.00 1.40 0.00 0.00 >>> 6.71 0.00 0.14 0.14 0.02 >>> dm-0 0.00 0.00 0.00 17.20 0.00 0.07 >>> 8.00 0.02 1.09 0.09 0.16 >>> dm-2 0.00 0.00 0.00 2090.20 0.00 8.16 >>> 8.00 1.65 0.79 0.06 12.52 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 1331.00 0.00 629.80 0.00 7.66 >>> 24.91 0.28 0.44 0.22 13.64 >>> sdb 0.00 0.00 0.00 2.80 0.00 0.02 >>> 14.07 0.00 0.79 0.57 0.16 >>> sdc 0.00 0.00 0.00 1.40 0.00 0.02 >>> 30.14 0.00 0.57 0.57 0.08 >>> dm-0 0.00 0.00 0.00 1.20 0.00 0.00 >>> 8.00 0.00 3.83 3.83 0.46 >>> dm-2 0.00 0.00 0.00 1959.60 0.00 7.65 >>> 8.00 1.43 0.73 0.07 13.20 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 6.60 0.00 0.60 0.00 0.03 >>> 96.00 0.00 1.33 1.33 0.08 >>> sdb 0.00 0.20 0.00 0.80 0.00 0.00 >>> 8.25 0.00 0.75 0.75 0.06 >>> sdc 0.00 0.20 0.00 0.20 0.00 0.00 >>> 16.00 0.00 0.00 0.00 0.00 >>> dm-0 0.00 0.00 0.00 6.20 0.00 0.02 >>> 8.00 0.01 0.97 0.03 0.02 >>> dm-2 0.00 0.00 0.00 1.00 0.00 0.00 >>> 8.00 0.00 3.00 0.60 0.06 >>> >>> # a second run of ls -l: >>> [root@dovecot2 ~]# date;time ls -l /var/spool/mail >>> Fri Nov 1 15:04:43 EDT 2013 >>> real 0m7.814s >>> user 0m0.096s >>> sys 0m0.139s >>> >>> [root@dovecot2 ~]# date;iostat -xmdz 5 >>> Fri Nov 1 15:04:41 EDT 2013 >>> Linux 2.6.32-358.18.1.el6.x86_64 (dovecot2.sri.utoronto.ca) >>> 11/01/2013 _x86_64_ (16 CPU) >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.01 21.20 0.16 28.28 0.00 0.19 >>> 14.27 0.02 0.57 0.27 0.78 >>> sdb 0.22 0.09 1.70 2.19 0.04 0.02 >>> 31.03 0.01 2.83 1.20 0.47 >>> sdc 0.00 0.04 12.63 0.75 1.21 0.05 >>> 192.59 0.02 1.60 0.77 1.04 >>> dm-0 0.00 0.00 0.11 2.46 0.00 0.01 >>> 8.57 0.00 1.17 0.51 0.13 >>> dm-1 0.00 0.00 0.00 0.00 0.00 0.00 >>> 8.00 0.00 2.03 0.21 0.00 >>> dm-2 0.00 0.00 0.07 47.02 0.00 0.18 >>> 8.15 0.04 0.93 0.14 0.65 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 730.20 0.00 1251.60 0.00 7.76 >>> 12.69 0.52 0.42 0.21 25.86 >>> sdb 0.00 1.00 0.80 25.40 0.01 0.16 >>> 12.62 0.01 0.37 0.18 0.46 >>> sdc 0.00 0.00 0.00 2.60 0.00 0.03 >>> 20.62 0.00 0.31 0.23 0.06 >>> dm-0 0.00 0.00 0.00 0.20 0.00 0.00 >>> 8.00 0.00 3.00 3.00 0.06 >>> dm-2 0.00 0.00 0.00 1981.00 0.00 7.74 >>> 8.00 1.04 0.52 0.13 25.96 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 1544.20 0.00 2844.60 0.00 17.14 >>> 12.34 1.10 0.39 0.21 58.34 >>> sdb 0.00 0.00 0.00 1.20 0.00 0.01 >>> 9.33 0.00 0.00 0.00 0.00 >>> sdc 0.00 0.00 0.00 0.60 0.00 0.00 >>> 8.00 0.00 3.00 1.00 0.06 >>> dm-0 0.00 0.00 0.00 3.00 0.00 0.01 >>> 8.00 0.00 0.07 0.07 0.02 >>> dm-2 0.00 0.00 0.00 4386.00 0.00 17.13 >>> 8.00 2.19 0.50 0.13 58.46 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 104.80 0.00 212.80 0.00 1.24 >>> 11.94 0.09 0.41 0.21 4.44 >>> sdb 0.00 0.00 0.20 3.00 0.00 0.01 >>> 9.81 0.00 0.50 0.50 0.16 >>> sdc 0.00 0.20 0.00 2.80 0.00 0.02 >>> 16.86 0.01 4.43 0.79 0.22 >>> dm-0 0.00 0.00 0.00 0.40 0.00 0.00 >>> 8.00 0.00 5.50 5.50 0.22 >>> dm-2 0.00 0.00 0.00 317.00 0.00 1.24 >>> 8.00 0.15 0.47 0.13 4.24 >>> >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 6.60 0.00 1.20 0.00 0.03 >>> 52.00 0.00 0.33 0.33 0.04 >>> sdb 0.00 0.20 0.00 1.00 0.00 0.00 >>> 9.60 0.00 0.00 0.00 0.00 >>> sdc 0.00 0.00 0.00 0.20 0.00 0.00 >>> 2.00 0.00 1.00 1.00 0.02 >>> dm-0 0.00 0.00 0.00 4.40 0.00 0.02 >>> 8.00 0.00 0.95 0.05 0.02 >>> dm-2 0.00 0.00 0.00 3.40 0.00 0.01 >>> 8.00 0.00 0.06 0.06 0.02 >>> >>> [root@dovecot2 ~]# date;vmstat 5 >>> Fri Nov 1 15:04:39 EDT 2013 >>> procs -----------memory---------- ---swap-- -----io---- --system-- >>> -----cpu----- >>> r b swpd free buff cache si so bi bo in cs us >>> sy id wa st >>> 0 0 1304 542644 229484 29885592 0 0 80 17 1 1 >>> 1 0 99 0 0 >>> 0 1 1304 542752 229484 29885696 0 0 0 2211 1518 1281 >>> 1 0 98 0 0 >>> 0 1 1304 544736 229484 29885696 0 0 6 17216 7535 7631 >>> 3 2 94 2 0 >>> 0 0 1304 543988 229484 29885696 0 0 0 7403 3782 3889 >>> 1 1 97 1 0 >>> 0 0 1304 543604 229484 29885712 0 0 1 215 399 277 >>> 0 0 100 0 0 -- Stan _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs