Re: 1 Large omap object found

Eugen Block <eblock@xxxxxx> · Mon, 31 Jul 2023 09:22:27 +0000

Hi,

can you share some more details like 'ceph df' and 'ceph osd df'? I  
don't have too much advice yet, but to see all entries in your meta  
pool you need add the --all flag because those objects are stored in  
namespaces:

rados -p default.rgw.meta ls --all

That pool contains user and bucket information (example):

# rados -p default.rgw.meta ls --all
users.uid       admin.buckets
users.keys      c0fba3ea7d9c4321b5205752c85baa85
users.uid       admin
users.keys      JBWPRAPP1AQG471AMGC4
users.uid       e434b82737cf4138b899c0785b49112d.buckets
users.uid       e434b82737cf4138b899c0785b49112d

Zitat von Mark Johnson <markj@xxxxxxxxx>:

I've been going round and round in circles trying to work this one  
out but I'm getting nowhere.  We're running a 4 node quincy cluster  
(17.2.6) which recently reported the following:

ceph.log-20230729.gz:2023-07-28T08:31:42.390003+0000 osd.26 (osd.26)  
13834 : cluster [WRN] Large omap object found. Object:  
5:6c65dd84:users.uid::callrecordings$callrecordings_rw.buckets:head  
PG: 5.21bba636 (5.16) Key count: 378454 Size (bytes): 75565579

This happened a week or so ago (only the key count was only just  
over the 200000 threshold on that occasion) and after much searching  
around, I found an article that suggested a deep scrub on the pg  
would likely resolve the issue, so I forced a deep scrub and shortly  
after, the warning cleared.  Came into the office today to discover  
the above.  It's on the same PG as before which is in the  
default.rgw.meta pool.  This time, after forcing a deep-scrub on  
that PG, nothing changed.  I did it a second time just to be sure  
but got the same result.

I keep finding a suse article that simply suggests increasing the  
threshold to the previous default of 2,000,000, but other articles I  
read say it was lowered for a reason and that by the time it hits  
that figure, it's too late so I don't want to just mask it.  Problem  
is that I don't really understand it.   I found a thread here from a  
bit over two years ago but their issue was in the  
default.rgw.buckets.index pool.  A step in the solution was to list  
out the problematic object id and check the objects per shard  
however, if I issue the command "rados -p default.rgw.meta ls" it  
returns nothing.  I get a big list from "rados -p  
default.rgw.buckets.index ls" just nothing from the first pool.  I  
think it may be because the meta pool isn't indexed based on  
something I read, but I really don't know what I'm talking about tbh.

I don't know if this is helpful, but if I list out all the PGs for  
that pool, there are 32 PGs and 5.16 shows 80186950 bytes and 401505  
keys.  PG 5.c has 75298 and 384 keys.  The remaining 30 PGs show  
zero bytes and zero keys.  I'm really not sure how to troubleshoot  
and resolve from here.  For the record, dynamic resharding is  
enabled in that no options have been set in the config and that is  
the default setting.

Based on the suse article I mentioned which also references the  
default.rgw.meta pool, I'm gathering our issue is because we have so  
many buckets that are all owned by the one user and the solution is  
either:

* delete unused buckets
* create multiple users and spread buckets evenly across all users  
(not something we can do)
* increase the threshold to stop the warning

Problem is that I'm having trouble verifying this is the issue.   
I've tried dumping out bucket stats to a file (radosgw-admin bucket  
stats > bucket_stats.txt) but after three hours this is still  
running with no output.

Thanks for your time,
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx