Re: Inode and dentry cache behavior

Shrinand Javadekar <shrinand@xxxxxxxxxxxxxx> · Wed, 29 Apr 2015 10:46:43 -0700

Awesome!! Thanks Dave!

On Tue, Apr 28, 2015 at 6:30 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Tue, Apr 28, 2015 at 05:17:14PM -0700, Shrinand Javadekar wrote:
>> I will look at the hardware. But, I think, there's also a possible
>> software problem here.
>>
>> If you look at the sequence of events, first a tmp file is created in
>> <mount-point>/tmp/tmp_blah. After a few writes, this file is renamed
>> to a different path in the filesystem.
>>
>> rename(<mount-point>/tmp/tmp_blah,
>> <mount-point>/objects/1004/eef/deadbeef/foo.data).
>>
>> The "tmp" directory above is created only once. Temp files get created
>> inside it and then get renamed. We wondered if this causes disk layout
>> issues resulting in slower performance. And then, we stumbled upon
>> this[1]. Someone complaining about the exact same problem.
>
> That's pretty braindead behaviour. That will screw performance and
> locality on any filesystem you do that on, not to mention age it
> extremely quickly.
>
> In the case of XFS, it forces allocation of all the inodes in one
> AG, rather than allowing XFs to distribute and balance inode
> allocation around the filesystem and keeping good
> directory/inode/data locality for all your data.
>
> Best way to do this is to create your tmp files using O_TMPFILE,
> with the source directory being the destination directory and then
> use linkat() rather than rename to make them visible in the
> directory.
>
>> One quick way to validate this was to delete the "tmp" directory
>> periodically and see what numbers we get. And they do. With 15 runs of
>> writing 80K objects in each run, our performance was dropping from
>> ~100MB/s to 30MB/s. With deleting the tmp directory after each run, we
>> saw the performance only drop from ~100MB/s to 80MB/s.
>>
>>  The explanation in the link below says that when xfs does not find
>> free extents in an existing allocation group, it frees up the extents
>> by copying data from existing extents to their target allocation group
>> (which happens because of renames). Is that explanation still valid?
>
> No, it wasn't correct even back then.  XFS does not move data around
> once it has been allocated and is on disk. Indeed, rename() does not
> move data, it only modifies directory entries.
>
> The problem is that the locality of a new inode is determined by the
> parent inode, and so if all new inodes are created in the same
> directory, then they are all created in the same AG. If you have
> millions of inodes, then you have a btree will millions on inodes in
> it in one AG, and pretty much none in any other AG. Hence inode
> allocation, which has to search for free inodes in a btree
> containing millions of records, can be extremely IO and CPU
> intensive and therefore slow. And the larger the number of inodes,
> the slower it will go....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs