Re: Deleting buckets and objects fails to reduce reported cluster usage

Ben <b@benjackson.email> · Tue, 02 Dec 2014 10:20:58 +1100

On 2014-12-02 09:25, Yehuda Sadeh wrote:
On Mon, Dec 1, 2014 at 2:10 PM, Ben <b@benjackson.email> wrote:
On 2014-12-02 08:39, Yehuda Sadeh wrote:

On Sat, Nov 29, 2014 at 2:26 PM, Ben <b@benjackson.email> wrote:

On 29/11/14 11:40, Yehuda Sadeh wrote:

On Fri, Nov 28, 2014 at 1:38 PM, Ben <b@benjackson.email> wrote:

On 29/11/14 01:50, Yehuda Sadeh wrote:

On Thu, Nov 27, 2014 at 9:22 PM, Ben <b@benjackson.email> wrote:

On 2014-11-28 15:42, Yehuda Sadeh wrote:

On Thu, Nov 27, 2014 at 2:15 PM, b <b@benjackson.email> wrote:

On 2014-11-27 11:36, Yehuda Sadeh wrote:

On Wed, Nov 26, 2014 at 3:49 PM, b <b@benjackson.email> 
wrote:

On 2014-11-27 10:21, Yehuda Sadeh wrote:

On Wed, Nov 26, 2014 at 3:09 PM, b <b@benjackson.email> 
wrote:

On 2014-11-27 09:38, Yehuda Sadeh wrote:

On Wed, Nov 26, 2014 at 2:32 PM, b <b@benjackson.email> 
wrote:

I've been deleting a bucket which originally had 60TB of 
data
in
it,
with
our cluster doing only 1 replication, the total usage 
was
120TB.

I've been deleting the objects slowly using S3 browser, 
and I
can
see
the
bucket usage is now down to around 2.5TB or 5TB with
duplication,
but
the
usage in the cluster has not changed.

I've looked at garbage collection (radosgw-admin gc list
--include
all)
and
it just reports square brackets "[]"

I've run radosgw-admin temp remove --date=2014-11-20, 
and it
doesn't
appear
to have any effect.

Is there a way to check where this space is being 
consumed?

Running 'ceph df' the USED space in the buckets pool is 
not
showing
any
of
the 57TB that should have been freed up from the 
deletion so
far.

Running 'radosgw-admin bucket stats | jshon | grep
size_kb_actual'
and
adding up all the buckets usage, this shows that the 
space
has
been
freed
from the bucket, but the cluster is all sorts of messed 
up.

ANY IDEAS? What can I look at?

Can you run 'radosgw-admin gc list --include-all'?

Yehuda

I've done it before, and it just returns square brackets 
[]
(see
below)

radosgw-admin gc list --include-all
[]

Do you know which of the rados pools have all that extra 
data?
Try
to
list that pool's objects, verify that there are no 
surprises
there
(e.g., use 'rados -p <pool> ls').

Yehuda

I'm just running that command now, and its taking some time.
There
is
a
large number of objects.

Once it has finished, what should I be looking for?

I assume the pool in question is the one that holds your 
objects
data?
You should be looking for objects that are not expected to 
exist
anymore, and objects of buckets that don't exist anymore. The
problem
here is to identify these.
I suggest starting by looking at all the existing buckets, 
compose
a
list of all the bucket prefixes for the existing buckets, and 
try
to
look whether there are objects that have different prefixes.

Yehuda

Any ideas? I've found the prefix, the number of objects in the 
pool
that
match that prefix numbers in the 21 millions
The actual 'radosgw-admin bucket stats' command reports it as 
only
having
1.2 million.

Well, the objects you're seeing are raw objects, and since rgw
stripes
the data, it is expected to have more raw objects than objects 
in
the
bucket. Still, it seems that you have much too many of these. 
You
can
try to check whether there are pending multipart uploads that 
were
never completed using the S3 api.
At the moment there's no easy way to figure out which raw 
objects
are
not supposed to exist. The process would be like this:
1. rados ls -p <data pool>
keep the list sorted
2. list objects in the bucket
3. for each object in (2), do: radosgw-admin object stat
--bucket=<bucket> --object=<object> --rgw-cache-enabled=false
(disabling the cache so that it goes quicker)
4. look at the result of (3), and generate a list of all the 
parts.
5. sort result of (4), compare it to (1)

Note that if you're running firefly or later, the raw objects 
are
not
specified explicitly in the command you run at (3), so you 
might
need
a different procedure, e.g., find out the raw objects random 
string
that is being used, remove it from the list generated in 1, 
etc.)

That's basically it.
I'll be interested to figure out what happened, why the garbage
collection didn't work correctly. You could try verifying that 
it's
working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket=<bucket> 
--object=<object>
     (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify that 
the raw
parts are listed there
   - wait a few hours, repeat last step, see that the parts 
don't
appear
there anymore
   - run rados -p <pool> ls, check to see if the raw objects 
still
exist

Yehuda

Not sure where to go from here, and our cluster is slowly 
filling
up
while
not clearing any space.

I did the last section:

I'll be interested to figure out what happened, why the garbage
collection didn't work correctly. You could try verifying that 
it's
working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket=<bucket> 
--object=<object>
     (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify that 
the raw
parts are listed there
   - wait a few hours, repeat last step, see that the parts 
don't
appear
there anymore
   - run rados -p <pool> ls, check to see if the raw objects 
still
exist

I added the file, did a stat and it displayed the json output
I removed the object and then tried to stat the object, this 
time it
failed
to stat the object
After this, I ran the gc list include all command and it 
displayed
nothing
but the square brackets []

Was the object larger than 512k? Also, did you do it within the 
300
seconds after removing the object?

There should exist a garbage collection pool (by default .rgw.gc, 
but
it can be something different if you configured your zone
differently), can you verify that you have it, and if so, what 
does it
contain?

Yehuda

Yes, the object was 10M. As soon as I had deleted it from the 
bucket, I
ran
the command to check garbage collection.
There is a .rgw.gc pool, we haven't changed it from default. It
contains
a
number of objects ~7800, but the size of the files is 0kb

They're expected to be 0kb, the data only resides in their omap, 
and
that's not reflected in the objects size. You could run 'rados
listomapkeys' on these.

Maybe garbage collection isn't working properly..

our gc settings are the following, we have 2 object gateways in 
our
cluster
too client.radosgw.obj01 and client.radosgw.obj02 (from 
ceph.conf)
[client.radosgw.obj01]
    rgw dns name = ceph.###.###
    host = obj01
    keyring = /etc/ceph/keyring.radosgw.obj01
    rgw socket path = /tmp/radosgw.sock
    log file = /var/log/ceph/radosgw.log
    rgw data = /var/lib/ceph/radosgw/obj01
    rgw thread pool size = 128
    rgw print continue = True
    debug rgw = 0
    rgw enable ops log = False
    log to stderr = False
    rgw enable usage log = False
    rgw gc max objs = 7877

You should put this line (rgw gx max objs) in the global section of
your ceph.conf. Either that, or run your radosgw-admin command with
'-n client.radosgw.obj02'. That might change some of the results
you're seeing (radosgw-admin gc list --include-all, etc.).

Yehuda

    rgw gc obj min wait = 300
    rgw gc processor period = 600
    rgw init timeout = 180
    rgw gc processor max time = 600
[client.radosgw.obj02]
    rgw dns name = ceph.###.###
    host = obj02
    keyring = /etc/ceph/keyring.radosgw.obj02
    rgw socket path = /tmp/radosgw.sock
    log file = /var/log/ceph/radosgw.log
    rgw data = /var/lib/ceph/radosgw/obj02
    rgw thread pool size = 128
    rgw print continue = True
    debug rgw = 0
    rgw enable ops log = False
    log to stderr = False
    rgw enable usage log = False
    rgw gc max objs = 7877
    rgw gc obj min wait = 300
    rgw gc processor period = 600
    rgw init timeout = 180
    rgw gc processor max time = 600

I've finally deleted the entire bucket. All 60TB cleared from the 
bucket,
the bucket no longer exists.

Yet running rados ls -p .rgw.buckets | grep '4804.14' still lists 
all the
_shadow_ files that have that buckets prefix.

Any ideas why these aren't being deleted/cleared up by garbage
collection?

Are there any errors in the log? Can you provide a log (debug rgw =
20, debug ms = 1) of the radosgw through the garbage collection 
stage?

Yehuda

Hi Yehuda,

Here is a snapshot of the log.

I created a new bucket, uploaded a 25mb file and then deleted it
i did a gc list --include all and could see the shadow files

I then forced a gc process, and the log has gone crazy with this stuff 
in
it, and it is still going.

2014-12-02 09:07:59.634544 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.13:6816/17803 -- ping v1 -- ?+0 0x7f8558000d20 con 0x1e9c470
2014-12-02 09:07:59.634578 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.12:6830/3830 -- ping v1 -- ?+0 0x7f8558001120 con 0x1e93090
2014-12-02 09:07:59.634592 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.14:6818/3517 -- ping v1 -- ?+0 0x7f8558001700 con 0x1e9d230
2014-12-02 09:07:59.634607 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.10:6806/3397 -- ping v1 -- ?+0 0x7f8558006050 con 0x1e9e0e0
2014-12-02 09:07:59.634621 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.19:6830/4000 -- ping v1 -- ?+0 0x7f8558000a10 con 0x1e9fdc0
2014-12-02 09:07:59.634634 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.11:6809/3451 -- ping v1 -- ?+0 0x7f8558001d50 con 0x1e9ef10
2014-12-02 09:07:59.634647 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.19:6803/3449 -- ping v1 -- ?+0 0x7f8558001fe0 con 0x1ea1aa0
2014-12-02 09:07:59.634660 7f86256f3700  1 -- 10.150.2.2:0/1009990 -->
10.150.2.17:6827/14001 -- ping v1 -- ?+0 0x7f85580028c0 con 0x1ea0bf0
2014-12-02 09:07:59.637762 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.108 10.150.2.18:6827/13867 158 ==== osd_op_reply(14761 gc.1283 
[call]
v0'0 uv98856 ondisk = 0) v6 ==== 174+0+11 (1634423764 0 1993775135)
0x7f8570006050 con 0x1f1e4d0
2014-12-02 09:07:59.637937 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.18:6827/13867 -- osd_op(client.6359480.0:14762 gc.1283 [call
lock.unlock] 6.10bb9719 ondisk+write e19074) v4 -- ?+0 0x1f30a80 con
0x1f1e4d0
2014-12-02 09:07:59.652166 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.108 10.150.2.18:6827/13867 159 ==== osd_op_reply(14762 gc.1283 
[call]
v19074'98857 uv98857 ondisk = 0) v6 ==== 174+0+0 (1358419719 0 0)
0x7f8570006050 con 0x1f1e4d0
2014-12-02 09:07:59.652293 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.15:6805/1032 -- osd_op(client.6359480.0:14763 gc.1284 [call
lock.lock] 6.4a5f5aa9 ondisk+write e19074) v4 -- ?+0 0x1f30a80 con 
0x1f290a0
2014-12-02 09:07:59.670108 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.53 10.150.2.15:6805/1032 130 ==== osd_op_reply(14763 gc.1284 
[call]
v19074'100398 uv100398 ondisk = 0) v6 ==== 174+0+0 (3807569510 0 0)
0x7f85b0000b60 con 0x1f290a0
2014-12-02 09:07:59.670239 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.15:6805/1032 -- osd_op(client.6359480.0:14764 gc.1284 [call
rgw.gc_list] 6.4a5f5aa9 ack+read e19074) v4 -- ?+0 0x1f30a80 con 
0x1f290a0
2014-12-02 09:07:59.673871 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.53 10.150.2.15:6805/1032 131 ==== osd_op_reply(14764 gc.1284 
[call] v0'0
uv100398 ondisk = 0) v6 ==== 174+0+11 (4101371043 0 1993775135)
0x7f85b0000b60 con 0x1f290a0
2014-12-02 09:07:59.674042 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.15:6805/1032 -- osd_op(client.6359480.0:14765 gc.1284 [call
lock.unlock] 6.4a5f5aa9 ondisk+write e19074) v4 -- ?+0 0x1f30a80 con
0x1f290a0
2014-12-02 09:07:59.686255 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.53 10.150.2.15:6805/1032 132 ==== osd_op_reply(14765 gc.1284 
[call]
v19074'100399 uv100399 ondisk = 0) v6 ==== 174+0+0 (2220917153 0 0)
0x7f85b0000b60 con 0x1f290a0
2014-12-02 09:07:59.686376 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.19:6812/3767 -- osd_op(client.6359480.0:14766 gc.1285 [call
lock.lock] 6.9f912e07 ondisk+write e19074) v4 -- ?+0 0x1f1f560 con 
0x1f0ea90
2014-12-02 09:07:59.721432 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.83 10.150.2.19:6812/3767 163 ==== osd_op_reply(14766 gc.1285 
[call]
v19074'302537 uv302537 ondisk = 0) v6 ==== 174+0+0 (2463619648 0 0)
0x7f85b8000900 con 0x1f0ea90
2014-12-02 09:07:59.721591 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.19:6812/3767 -- osd_op(client.6359480.0:14767 gc.1285 [call
rgw.gc_list] 6.9f912e07 ack+read e19074) v4 -- ?+0 0x1f30a80 con 
0x1f0ea90
2014-12-02 09:07:59.725699 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.83 10.150.2.19:6812/3767 164 ==== osd_op_reply(14767 gc.1285 
[call] v0'0
uv302537 ondisk = 0) v6 ==== 174+0+11 (3924946656 0 1993775135)
0x7f85b8000900 con 0x1f0ea90
2014-12-02 09:07:59.725868 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.19:6812/3767 -- osd_op(client.6359480.0:14768 gc.1285 [call
lock.unlock] 6.9f912e07 ondisk+write e19074) v4 -- ?+0 0x1f1f560 con
0x1f0ea90
2014-12-02 09:07:59.738287 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.83 10.150.2.19:6812/3767 165 ==== osd_op_reply(14768 gc.1285 
[call]
v19074'302538 uv302538 ondisk = 0) v6 ==== 174+0+0 (314638285 0 0)
0x7f85b8000900 con 0x1f0ea90
2014-12-02 09:07:59.738415 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.10:6803/3331 -- osd_op(client.6359480.0:14769 gc.1286 [call
lock.lock] 6.61beb660 ondisk+write e19074) v4 -- ?+0 0x1f30a80 con 
0x1f1ce50
2014-12-02 09:07:59.764523 7f86277f8700  1 -- 10.150.2.2:0/1009990 <== 
osd.1
10.150.2.10:6803/3331 145 ==== osd_op_reply(14769 gc.1286 [call]
v19074'136677 uv136677 ondisk = 0) v6 ==== 174+0+0 (3559159328 0 0)
0x7f8540002dd0 con 0x1f1ce50
2014-12-02 09:07:59.764670 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.10:6803/3331 -- osd_op(client.6359480.0:14770 gc.1286 [call
rgw.gc_list] 6.61beb660 ack+read e19074) v4 -- ?+0 0x1f1f560 con 
0x1f1ce50
2014-12-02 09:07:59.767836 7f86277f8700  1 -- 10.150.2.2:0/1009990 <== 
osd.1
10.150.2.10:6803/3331 146 ==== osd_op_reply(14770 gc.1286 [call] v0'0
uv136677 ondisk = 0) v6 ==== 174+0+11 (1532209036 0 1993775135)
0x7f8540002dd0 con 0x1f1ce50
2014-12-02 09:07:59.768009 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.10:6803/3331 -- osd_op(client.6359480.0:14771 gc.1286 [call
lock.unlock] 6.61beb660 ondisk+write e19074) v4 -- ?+0 0x1f30a80 con
0x1f1ce50
2014-12-02 09:07:59.780792 7f86277f8700  1 -- 10.150.2.2:0/1009990 <== 
osd.1
10.150.2.10:6803/3331 147 ==== osd_op_reply(14771 gc.1286 [call]
v19074'136678 uv136678 ondisk = 0) v6 ==== 174+0+0 (1412649901 0 0)
0x7f8540002dd0 con 0x1f1ce50
2014-12-02 09:07:59.780968 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.12:6833/4016 -- osd_op(client.6359480.0:14772 gc.1287 [call
lock.lock] 6.9bfdc387 ondisk+write e19074) v4 -- ?+0 0x1f1f560 con 
0x1f3b2f0
2014-12-02 09:07:59.804154 7f86277f8700  1 -- 10.150.2.2:0/1009990 <==
osd.65 10.150.2.12:6833/4016 118 ==== osd_op_reply(14772 gc.1287 
[call]
v19074'279063 uv279063 ondisk = 0) v6 ==== 174+0+0 (4117862336 0 0)
0x7f85d80024b0 con 0x1f3b2f0
2014-12-02 09:07:59.804279 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 -->
10.150.2.12:6833/4016 -- osd_op(client.6359480.0:14773 gc.1287 [call
rgw.gc_list] 6.9bfdc387 ack+read e19074) v4 -- ?+0 0x1f30a80 con 
0x1f3b2f0

Can you find the part of the log where it's handling a shard that
actually has an object inside? These are all just iterating over empty
shards.

Yehuda

How can I tell if the shard has an object in it from the logs?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com