continued warnings: Large omap object found

Seth Galitzer <sgsax@xxxxxxx> · Thu, 27 Feb 2020 17:02:32 -0600

I do not have a large ceph cluster, only 4 nodes plus a mon/mgr with 48 
OSDs. I have one data pool and one metadata pool with a total of about 
140TB of usable storage. I have maybe 30 or so clients. The rest of my 
systems connect via a host that is a ceph client and then reshares 
through samba and nfs-ganesha. I'm not using rgw anywhere. I'm running 
the latest stable release of nautilus (14.2.7) and have had it in 
production since August 2019. All ceph nodes and the smb/nfs host are 
running centos7 with latest patches. Other clients are a mix of debian 
and ubuntu.

For the last several weeks, I have been getting the warning "Large omap 
object found" off and on. I've been resolving it by gradually increasing 
the value of osd_deep_scrub_large_omap_object_key_threshold and then 
running a deep scrub on the affected pg. I have now increased this 
threshold to 1000000 and am wondering if I should keep doing this or if 
there is another problem that needs to be addressed.

The affected pg has been different most times, but they are all on the 
same osd and with the same mds object. Here's an excerpt from my current 
set of logs to show what I'm seeing:

# zgrep -i "large omap object found" /var/log/ceph/ceph.log*
/var/log/ceph/ceph.log:2020-02-27 06:02:01.761641 osd.40 (osd.40) 1578 : 
cluster [WRN] Large omap object found. Object: 
2:654134d2:::mds0_openfiles.0:head PG: 2.4b2c82a6 (2.26) Key count: 
1048576 Size (bytes): 46403355
/var/log/ceph/ceph.log:2020-02-27 16:18:00.328869 osd.40 (osd.40) 1585 : 
cluster [WRN] Large omap object found. Object: 
2:654134d2:::mds0_openfiles.0:head PG: 2.4b2c82a6 (2.26) Key count: 
1048559 Size (bytes): 46407183
/var/log/ceph/ceph.log-20200227.gz:2020-02-26 19:56:24.972431 osd.40 
(osd.40) 1450 : cluster [WRN] Large omap object found. Object: 
2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count: 
939236 Size (bytes): 40179994
/var/log/ceph/ceph.log-20200227.gz:2020-02-26 21:14:16.497161 osd.40 
(osd.40) 1460 : cluster [WRN] Large omap object found. Object: 
2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count: 
939232 Size (bytes): 40179796
/var/log/ceph/ceph.log-20200227.gz:2020-02-26 21:15:06.399267 osd.40 
(osd.40) 1464 : cluster [WRN] Large omap object found. Object: 
2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count: 
939231 Size (bytes): 40179756

Unfortunately, older logs have already been rotated out, but if memory 
serves correctly, they had similar messages. As you can see, the key 
count continues to increase. Last week, I bumped the threshold to 750000 
to clear the warning. Before that, I had bumped to 500000. It looks to 
me like something isn't getting cleaned up like it's supposed to. I 
haven't been using ceph long enough to figure out what that might be.

Do I continue to bump the key threshold and not worry about the 
warnings, or is there something going on that needs to be corrected? At 
what point is the threshold too high? If the problem is due to a 
specific client not closing files, is it possible to identify that 
client and attempt to reset it?

Any advice is welcome. I'm happy to provide additional data if needed.

Thanks.
Seth

--
Seth Galitzer
Systems Coordinator
Computer Science Department
Kansas State University
http://www.cs.ksu.edu/~sgsax
sgsax@xxxxxxx
785-532-7790
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx