Hello! I’m trying to track down odd write performance from a test of our application’s I/O workload. Admittedly I am not extremely experienced with this domain (file systems, storage, tuning, etc.). I’ve done a ton of research and I think I’ve gotten as far as I possibly can without reaching out for help from domain experts. Here is the setup: OS: CentOS 7.3 Kernel: 3.10.0-693.2.2.el7.x86_64 RAID: LSI RAID controller - RAID 6 with 10 disks - Strip Size 128 (and thus a stripe size of 1MB if I understand correctly) LVM: One PV, VG, and LV is built on top of the RAID6. The output is below. RAID output: ----------------------------------------------------------------- DG/VD TYPE State Access Consist Cache Cac sCC Size Name ----------------------------------------------------------------- 0/0 RAID6 Optl RW No RAWBD - ON 72.761 TB DATA ---------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type ---------------------------------------------------------------------------- 18:0 35 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:1 39 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:2 38 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:3 41 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:4 36 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:5 37 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:6 42 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:7 45 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:8 46 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:9 44 Onln 0 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:10 43 Onln 1 9.094 TB SAS HDD N N 512B ST10000NM0096 U - 18:11 40 Onln 1 9.094 TB SAS HDD N N 512B ST10000NM0096 U - Layered on top of this we have LVM with everything aligned to the stripe size. --- Physical volume --- PV Name /dev/sda VG Name vgdata PV Size 72.76 TiB / not usable 4.00 MiB Allocatable yes (but full) PE Size 4.00 MiB Total PE 19074047 Free PE 0 Allocated PE 19074047 PV UUID GsHPeD-5uRM-SOUz-8eEO-kznf-zTaT-oEIx58 --- Volume group --- VG Name vgdata System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 4 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 1 Act PV 1 VG Size 72.76 TiB PE Size 4.00 MiB Total PE 19074047 Alloc PE / Size 19074047 / 72.76 TiB Free PE / Size 0 / 0 VG UUID PKFh9X-3gTb-0GZO-vLAc-vdPI-lV6W-aoNDDU --- Logical volume --- LV Path /dev/vgdata/lvdata LV Name lvdata VG Name vgdata LV UUID esOGWf-jV89-7euY-WV3h-MZ2p-uqmt-qXkC5F LV Write Access read/write LV Creation host, time XXXXXXX, 2017-10-25 14:34:06 +0000 LV Status available # open 1 LV Size 72.76 TiB Current LE 19074047 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 Here is the output when the XFS filesystem is created: mkfs.xfs -d su=128k,sw=8 -L DATA -f /dev/mapper/vgdata-lvdata mkfs.xfs: Specified data stripe width 2048 is not the same as the volume stripe width 512 (KEA: I’m not sure if this is an actual problem or not from googling around) meta-data=/dev/mapper/vgdata-lvdata isize=512 agcount=73, agsize=268435424 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=19531824128, imaxpct=1 = sunit=32 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 And finally the output of mount: (rw,relatime,attr2,inode64,sunit=256,swidth=2048,noquota) Our application writes to a nested directory hierarchy as follows: <THREAD>/<DATE>/<HOUR>/<MINUTE>/<DATE>-<HOUR><MINUTE<SECOND>.data The writes to a file are all approximately 1MiB in size. Depending on our data rate, there are potentially 100-200 writes to a given file. Each second a new file is created. There are 4 such threads scheduling these writes. The writes to the files will be scheduled sequentially. The application can do AIO or blocking IO, both see the same problem. In order to determine if our application is fully to blame, we are running a performance test by simulating the writes with a “dd if=/dev/zero of=<path/like/above>.data bs=1024K”. We let those dd processes write as much as they possibly can for a second before stopping them and moving on to the next file. What we’re seeing is that write performance starts off around 1400-1500 MB/s, decreasing approximately linearly all the way down to around ~600 MB/s after ~18 minutes before suddenly shooting back up to 1400-1500 MB/s. This cycle continues, with the crest and troughs slowly decreasing as the disk fills up (which I believe is expected). We tried running it with 2 threads. We saw the same degradation and recovery performance profile, except it took ~36 minutes to bottom out and recover. Likewise, with only 1 thread it took ~72 minutes. In all cases the pattern continued until the disk was full. We thought perhaps the directory structure was problematic, so we tried the following directory structure too: <THREAD>/<DATE>/<DATE>-<HOUR><MINUTE<SECOND>.data. This also had one file per second. This time, it took about 12.2 hours for the performance to bottom out before instantly shooting back up again. Some other notes: - I ran the same test with an Adaptec RAID controller as well, which gave the same performance profile. - I ran the same test with an ext4 filesystem just to see if it gave the same performance profile. It did not - the performance slowly degraded over time before a quick dropoff as the disk reached max capacity. I expected a different profile, but just wanted to run something to make sure that would be the case. I’ve been trying to read about internals to correlate this performance profile with something, but since I’m so new to this it’s kind of tough to filter out noise and key in on something meaningful. If there is any information that I haven’t provided, please let me know and I’ll happily provide it. Thanks!!! -Kyle Ames This email and any attachments thereto may contain private, confidential, and/or privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto. ��.n��������+%������w��{.n�����{�����jg��������ݢj����G�������j:+v���w�m������w�������h�����٥