Inode and dentry cache behavior

Shrinand Javadekar <shrinand@xxxxxxxxxxxxxx> · Thu, 23 Apr 2015 12:50:15 -0700

Hi,

I am running Openstack Swift on a single server with 8 disks. All
these 8 disks are formatted with default XFS parameters. Each disk has
a capacity of 3TB. The machine has 64GB of data.

Here's what Openstack Swift does:

1. The file-system is mounted at /srv/node/r0.
2. Creates a temp file: /srv/node/r0/tmp/tmp_sdfsdf
3. Writes to this file: 4 writes of 64K each and does an fsync and
close. Final size of the file is 256K.
4. Create the path: /srv/node/r0/1004/eef/deadbeef. The directory
/srv/node/r0/objects/1004 already existed before. So it only needs to
create "eef" and "deadbeef". Before creating each directory, it
verifies that the directory does not exist.
5. Rename the file /srv/node/r0/tmp/tmp_sdfsdf to
/srv/node/r0/objects/1004/eef/deadbeef/foo.data.
6. fsync /srv/node/r0/objects/1004/eef/deadbeef/foo.data.
7. It then does a directory listing for /srv/node/r0/objects/1004/eef.
8. Opens the file /srv/node/r0/objects/1004/hashes.pkl
9. Writes to the file /srv/node/r0/objects/1004/hashes.pkl
10. Closes the file /srv/node/r0/objects/1004/hashes.pkl.

Writes are getting sharded across ~1024 directories. Essentially,
there are 0000-1024 directories under /srv/node/r0/objects/. 1004
above is one of them in the example above.

This works great when the filesystem is newly formatted and mounted.
However, as more and more data get's written to the system, the above
sequence of events progressively gets slower.

* We observe that the time for fsync remains pretty much constant throughout.
* What seems to be causing the performance to nosedive, is that inode
and dentry caching doesn't seem to be working.
* For experiment sake, we set vfs_cache_pressure to 0 so there would
be no reclaiming of inode and dentry cache entries. However, that does
not seem to help.
* We see openat() calls taking close to 1 second.

Any ideas, what might be causing this behavior? Are there other
params, specifically, xfs params that can be tuned for this workload.
The sequence of events above is the typical workload, at high
concurrency.

Here are the answers to other questions requested from the XFS wiki page:

* kernel version (uname -a)
3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64
x86_64 x86_64 GNU/Linux

* xfsprogs version
xfs_repair version 3.1.7

* number of CPUs
16

* contents of /proc/meminfo
See attached file mem_info.

* contents of /proc/mounts
/dev/mapper/troll_data_vg_23578621012a_1-troll_data_lv_1 /srv/node/r0
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_2-troll_data_lv_2 /srv/node/r1
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_3-troll_data_lv_3 /srv/node/r2
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_4-troll_data_lv_4 /srv/node/r3
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_5-troll_data_lv_5 /srv/node/r4
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_6-troll_data_lv_6 /srv/node/r5
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_7-troll_data_lv_7 /srv/node/r6
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_8-troll_data_lv_8 /srv/node/r7
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0

* contents of /proc/partitions
See attached file partitions_info.

* RAID layout (hardware and/or software)
NO RAID!!

* LVM configuration
See attached file lvm_info. Use lvdisplay to obtain it.

* type of disks you are using
sdm   disk    2.7T ST3000NXCLAR3000
sdm1  part      1M
sdm2  part    2.7T
dm-1  lvm     2.7T

* write cache status of drives
Drives have no write cache.

* size of BBWC and mode it is running in
No BBWC

* xfs_info output on the filesystem in question
meta-data=/dev/mapper/troll_data_vg_23578621012a_8-troll_data_lv_8
isize=256    agcount=4, agsize=183141376 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=732565504, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=357698, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

* dmesg output showing all error messages and stack traces
No errors.

* IOStat and VMStat output.
See the attached files iostat_log and vmstat_log.

-Shri
Attachment:
mem_info

Description: Binary data
Attachment:
partitions_info

Description: Binary data
Attachment:
iostat_log

Description: Binary data
Attachment:
vmstat_log

Description: Binary data
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs