Awesome!! Thanks Dave! On Tue, Apr 28, 2015 at 6:30 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Tue, Apr 28, 2015 at 05:17:14PM -0700, Shrinand Javadekar wrote: >> I will look at the hardware. But, I think, there's also a possible >> software problem here. >> >> If you look at the sequence of events, first a tmp file is created in >> <mount-point>/tmp/tmp_blah. After a few writes, this file is renamed >> to a different path in the filesystem. >> >> rename(<mount-point>/tmp/tmp_blah, >> <mount-point>/objects/1004/eef/deadbeef/foo.data). >> >> The "tmp" directory above is created only once. Temp files get created >> inside it and then get renamed. We wondered if this causes disk layout >> issues resulting in slower performance. And then, we stumbled upon >> this[1]. Someone complaining about the exact same problem. > > That's pretty braindead behaviour. That will screw performance and > locality on any filesystem you do that on, not to mention age it > extremely quickly. > > In the case of XFS, it forces allocation of all the inodes in one > AG, rather than allowing XFs to distribute and balance inode > allocation around the filesystem and keeping good > directory/inode/data locality for all your data. > > Best way to do this is to create your tmp files using O_TMPFILE, > with the source directory being the destination directory and then > use linkat() rather than rename to make them visible in the > directory. > >> One quick way to validate this was to delete the "tmp" directory >> periodically and see what numbers we get. And they do. With 15 runs of >> writing 80K objects in each run, our performance was dropping from >> ~100MB/s to 30MB/s. With deleting the tmp directory after each run, we >> saw the performance only drop from ~100MB/s to 80MB/s. >> >> The explanation in the link below says that when xfs does not find >> free extents in an existing allocation group, it frees up the extents >> by copying data from existing extents to their target allocation group >> (which happens because of renames). Is that explanation still valid? > > No, it wasn't correct even back then. XFS does not move data around > once it has been allocated and is on disk. Indeed, rename() does not > move data, it only modifies directory entries. > > The problem is that the locality of a new inode is determined by the > parent inode, and so if all new inodes are created in the same > directory, then they are all created in the same AG. If you have > millions of inodes, then you have a btree will millions on inodes in > it in one AG, and pretty much none in any other AG. Hence inode > allocation, which has to search for free inodes in a btree > containing millions of records, can be extremely IO and CPU > intensive and therefore slow. And the larger the number of inodes, > the slower it will go.... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs