Re: Resolving Large omap objects in RGW index pool

Tomasz Płaza <tomasz.plaza@xxxxxxxxxx> · Wed, 17 Oct 2018 13:11:32 +0200

    Hi,

    I have a similar issue, and created a simple bash file to delete old
    indexes (it is PoC and have not been tested on production):

    for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' |
    sort`

    do

      actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r
    '.id'`

      for instance in `radosgw-admin metadata list bucket.instance | jq
    -r '.[]' | grep ${bucket}: | cut -d ':' -f 2`

      do

        if [ "$actual_id" != "$instance" ]

        then

          radosgw-admin bi purge --bucket=${bucket}
    --bucket-id=${instance}

          radosgw-admin metadata rm
    bucket.instance:${bucket}:${instance}

        fi

      done

    done

    I find it more readable than mentioned one liner. Any sugestions on
    this topic are greatly appreciated.

    Tom

      Hi,

        Having spent some time on the below issue, here are the
          steps I took to resolve the "Large omap objects" warning. 
          Hopefully this will help others who find themselves in this
          situation.

        I got the object ID and OSD ID implicated from the ceph
          cluster logfile on the mon.  I then proceeded to the
          implicated host containing the OSD, and extracted the
          implicated PG by running the following, and looking at which
          PG had started and completed a deep-scrub around the warning
          being logged:

        grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep
          '(Large omap|deep-scrub)'

        If the bucket had not been sharded sufficiently (IE the
          cluster log showed a "Key Count" or "Size" over the
          thresholds), I ran through the manual sharding procedure
          (shown here: https://tracker.ceph.com/issues/24457#note-5)

        Once this was successfully sharded, or if the bucket was
          previously sufficiently sharded by Ceph prior to disabling the
          functionality I was able to use the following command
          (seemingly undocumented for Luminous http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands):

        radosgw-admin bi purge --bucket ${bucketname} --bucket-id
          ${old_bucket_id}

        I then issued a ceph pg deep-scrub against the PG that had
          contained the Large omap object.

        Once I had completed this procedure, my Large omap object
          warnings went away and the cluster returned to HEALTH_OK.  

        However our radosgw bucket indexes pool now seems to be
          using substantially more space than previously.  Having looked
          initially at this bug, and in particular the first comment:

        http://tracker.ceph.com/issues/34307#note-1

        I was able to extract a number of bucket indexes that had
          apparently been resharded, and removed the legacy index using
          the radosgw-admin bi purge --bucket ${bucket} ${marker}.  I am
          still able  to perform a radosgw-admin metadata get
          bucket.instance:${bucket}:${marker} successfully, however now
          when I run rados -p .rgw.buckets.index ls | grep ${marker}
          nothing is returned.  Even after this, we were still seeing
          extremely high disk usage of our OSDs containing the bucket
          indexes (we have a dedicated pool for this).  I then modified
          the one liner referenced in the previous link as follows:

         grep -E '"bucket"|"id"|"marker"' bucket-stats.out | awk -F
          ":" '{print $2}' | tr -d '",' | while read -r bucket; do read
          -r id; read -r marker; [ "$id" == "$marker" ] && true
          || NEWID=`radosgw-admin --id rgw.ceph-rgw-1 metadata get
          bucket.instance:${bucket}:${marker} | python -c 'import sys,
          json; print
          json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
          while [ ${NEWID} ]; do if [ "${NEWID}" != "${marker}" ]
          && [ ${NEWID} != ${bucket} ] ; then echo "$bucket
          $NEWID"; fi; NEWID=`radosgw-admin --id rgw.ceph-rgw-1 metadata
          get bucket.instance:${bucket}:${NEWID} | python -c 'import
          sys, json; print
          json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
          done; done > buckets_with_multiple_reindexes2.txt

        This loops through the buckets that have a different
          marker/bucket_id, and looks to see if a new_bucket_instance_id
          is there, and if so will loop through until there is no longer
          a "new_bucket_instance_id".  After letting this complete, this
          suggests that I have over 5000 indexes for 74 buckets, some of
          these buckets have > 100 indexes apparently.

          :~# awk '{print $1}' buckets_with_multiple_reindexes2.txt
            | uniq | wc -l
          74

          ~# wc -l buckets_with_multiple_reindexes2.txt
          5813 buckets_with_multiple_reindexes2.txt

        This is running a single realm, multiple zone
          configuration, and no multi site sync, but the closest I can
          find to this issue is this bug https://tracker.ceph.com/issues/24603

        Should I be OK to loop through these indexes and remove any
          with a reshard_status of 2, a new_bucket_instance_id that does
          not match the bucket_instance_id returned by the command:

        radosgw-admin bucket stats --bucket ${bucket}

        I'd ideally like to get to a point where I can turn dynamic
          sharding back on safely for this cluster.

        Thanks for any assistance, let me know if there's any more
          information I should provide
        Chris

            On Thu, 4 Oct 2018 at 18:22 Chris Sarginson
              <csargiso@xxxxxxxxx> wrote:

              Hi,

                Thanks for the response - I am still unsure as to
                  what will happen to the "marker" reference in the
                  bucket metadata, as this is the object that is being
                  detected as Large.  Will the bucket generate a new
                  "marker" reference in the bucket metadata?  

                I've been reading this page to try and get a better
                  understanding of this 
                http://docs.ceph.com/docs/luminous/radosgw/layout/

                However I'm no clearer on this (and what the
                  "marker" is used for), or why there are multiple
                  separate "bucket_id" values (with different mtime
                  stamps) that all show as having the same number of
                  shards.  

                If I were to remove the old bucket would I just be
                  looking to execute 

                rados - p .rgw.buckets.index rm
                  .dir.default.5689810.107

                Is the differing marker/bucket_id in the other
                  buckets that was found also an indicator?  As I say,
                  there's a good number of these, here's some additional
                  examples, though these aren't necessarily reporting as
                  large omap objects:

                  "BUCKET1", "default.281853840.479",
                    "default.105206134.5",
                  "BUCKET2", "default.364663174.1",
                    "default.349712129.3674",

                Checking these other buckets, they are exhibiting
                  the same sort of symptoms as the first (multiple
                  instances of radosgw-admin metadata get showing what
                  seem to be multiple resharding processes being run,
                  with different mtimes recorded).

                Thanks 

                Chris

                  On Thu, 4 Oct 2018 at 16:21 Konstantin
                    Shalygin <k0ste@xxxxxxxx>
                    wrote:

                        Hi,

Ceph version: Luminous 12.2.7

Following upgrading to Luminous from Jewel we have been stuck with a
cluster in HEALTH_WARN state that is complaining about large omap objects.
These all seem to be located in our .rgw.buckets.index pool.  We've
disabled auto resharding on bucket indexes due to seeming looping issues
after our upgrade.  We've reduced the number reported of reported large
omap objects by initially increasing the following value:

~# ceph daemon mon.ceph-mon-1 config get
osd_deep_scrub_large_omap_object_value_sum_threshold
{
    "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648"
}

However we're still getting a warning about a single large OMAP object,
however I don't believe this is related to an unsharded index - here's the
log entry:

2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 :
cluster [WRN] Large omap object found. Object:
15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
(bytes): 4458647149

The object in the logs is the "marker" object, rather than the bucket_id -
I've put some details regarding the bucket here:

https://pastebin.com/hW53kTxL

The bucket limit check shows that the index is sharded, so I think this
might be related to versioning, although I was unable to get confirmation
that the bucket in question has versioning enabled through the aws
cli(snipped debug output below)

2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
'x-amz-request-id': 'tx0000000000000020e3b15-005bb37c85-15870fe0-default',
'content-type': 'application/xml'}
2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
body:
<?xml version="1.0" encoding="UTF-8"?><VersioningConfiguration xmlns="
http://s3.amazonaws.com/doc/2006-03-01/"></VersioningConfiguration>

After dumping the contents of large omap object mentioned above into a file
it does seem to be a simple listing of the bucket contents, potentially an
old index:

~# wc -l omap_keys
17467251 omap_keys

This is approximately 5 million below the currently reported number of
objects in the bucket.

When running the commands listed here:
http://tracker.ceph.com/issues/34307#note-1

The problematic bucket is listed in the output (along with 72 other
buckets):
"CLIENTBUCKET", "default.294495648.690", "default.5689810.107"

As this tests for bucket_id and marker fields not matching to print out the
information, is the implication here that both of these should match in
order to fully migrate to the new sharded index?

I was able to do a "metadata get" using what appears to be the old index
object ID, which seems to support this (there's a "new_bucket_instance_id"
field, containing a newer "bucket_id" and reshard_status is 2, which seems
to suggest it has completed).

I am able to take the "new_bucket_instance_id" and get additional metadata
about the bucket, each time I do this I get a slightly newer
"new_bucket_instance_id", until it stops suggesting updated indexes.

It's probably worth pointing out that when going through this process the
final "bucket_id" doesn't match the one that I currently get when running
'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also
suggests that no further resharding has been done as "reshard_status" = 0
and "new_bucket_instance_id" is blank.  The output is available to view
here:

https://pastebin.com/g1TJfKLU

It would be useful if anyone can offer some clarification on how to proceed
from this situation, identifying and removing any old/stale indexes from
the index pool (if that is the case), as I've not been able to spot
anything in the archives.

If there's any further information that is needed for additional context
please let me know.

                      Usually, when you bucket
                          is automatically resharded in some case old
                          big index is not deleted - this is your large
                          omap object.
                      This index is safe to
                          delete. Also look at [1].

                      [1] https://tracker.ceph.com/issues/24457

                      k

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com