Re: [BUG REPORT] missing memory counter introduced by xfs

Lin Feng <linf@xxxxxxxxxxxxxxxxxx> · Fri, 09 Sep 2016 14:32:18 +0800

Hi Dave,

A final not-clear concept about XFS, look beblow please.

On 09/09/2016 04:44 AM, Dave Chinner wrote:
On Thu, Sep 08, 2016 at 06:07:45PM +0800, Lin Feng wrote:
Hi Dave,

Thank you for your fast reply, look beblow please.

On 09/08/2016 05:22 AM, Dave Chinner wrote:
On Wed, Sep 07, 2016 at 06:36:19PM +0800, Lin Feng wrote:
Hi all nice xfs folks,

I'm a rookie and really fresh new in xfs and currently I ran into an
issue same as the following link described:
http://oss.sgi.com/archives/xfs/2014-04/msg00058.html

In my box(running cephfs osd using xfs kernel 2.6.32-358) and I sum
all possible memory counter can be find but it seems that nearlly
26GB memory has gone and they are back after I echo 2 >
/proc/sys/vm/drop_caches, so seems these memory can be reclaimed by
slab.

It isn't "reclaimed by slab". The XFS metadata buffer cache is
reclaimed by a memory shrinker, which are for reclaiming objects
>from caches that aren't the page cache. "echo 2 >
/proc/sys/vm/drop_caches" runs the memory shrinkers rather than page
cache reclaim. Many slab caches are backed by memory shrinkers,
which is why it is thought that "2" is "slab reclaim"....

And according to what David said replying in the list:
..
That's where your memory is - in metadata buffers. The xfs_buf slab
entries are just the handles - the metadata pages in the buffers
usually take much more space and it's not accounted to the slab
cache nor the page cache.

That's exactly the case.

  Minimum / Average / Maximum Object : 0.02K / 0.33K / 4096.00K

   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
4383036 4383014  99%    1.00K 1095759        4   4383036K xfs_inode
5394610 5394544  99%    0.38K 539461       10   2157844K xfs_buf

So, you have *5.4 million* active metadata buffers. Each buffer will
hold  1 or 2 4k pages on your kernel, so simple math says 4M * 4k +
1.4M * 8k = 26G. There's no missing counter here....

Does xattr contribute to such metadata buffers or there is something else?

xattrs are metadata, so if they don't fit in line in the inode
(typical for ceph because it uses xattrs larger than 256 bytes) then
they are held in external blocks which are cached in the buffer
cache.

So the 'buffer cache' here you mean is the pages handled by xfs_buf struct, used 
to hold the xattrs if the inode inline data space overflows, not the 
'beffer/cache' seen via free command, they won't reflect in cache field by free 
command, right?

After consulting to my teammate, who told me that in our case small files
(there are a looot, look below) always use xattr.

Which means that if you have 4.4M cached inodes, you probably have
~4.4M xattr metadata buffers in cache for those inodes, too.

Another thing is do we need to export such thing or we have to make
the computation every time to figure out if we leak memory.
And more important is that seems these memory has a low priority to
be reclaimed by memory reclaim mechanism, does it due to most of the
slab objects are active?

"active" slab objects simply mean they are allocated. It does not
mean they are cached or imply anything else about the object's life
cycle.

Sorry, I mistake the concept for active in slab, thanks your explanation.

    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
4383036 4383014  99%    1.00K 1095759        4   4383036K xfs_inode
5394610 5394544  99%    0.38K 539461       10   2157844K xfs_buf

In fact xfs eats a lot of my ram and I will never know where it goes
without diving into xfs source, at least I'm the second extreme user
;-)

Obviously your workload is doing something extremely metadata
intensive to have a cache footprint like this - you have more cached
buffers than inodes, dentries, etc. That in itself is very unusual -
can you describe what is stored on that filesystem and how large the
attributes being stored in each inode are?

The fs-user behavior is that ceph-osd daemon will intensively
pull/synchronize/update files from other osd when the server is up.
In our case cephfs osd stores a lot of small pictures in the
filesystem, and I do some simple analysis, there are nearly
3,000,000 files on each disk and there are 10 such disk.
[root@wzdx49 osd.670]# find current -type f -size -512k | wc -l
2668769
[root@wzdx49 ~]# find /data/osd/osd.67 -type f | wc -l
2682891
[root@wzdx49 ~]# find /data/osd/osd.67 -type d | wc -l
109760

Yup, that's a pretty good indication that you have a high metadata
to data ratio in each filesystem, and that ceph is accessing the
metadata more intensively than the data. The fact that the metadata
buffer count roughly matches the cached inode count tells me that
the memory reclaim code is being fairly balanced about what it
reclaims under memory pressure - I think the problem here is more
that you didn't know where the memory was being used than anything
else....

Yes, that's exactly why I sent this mail.
Again, thanks for your detailed explanation.

Best regards,
linfeng

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs