Re: Deleting buckets and objects fails to reduce reported cluster usage

Ben <b@benjackson.email> · Tue, 02 Dec 2014 11:23:45 +1100

On 2014-12-02 11:21, Yehuda Sadeh wrote:
On Mon, Dec 1, 2014 at 3:47 PM, Ben <b@benjackson.email> wrote:
On 2014-12-02 10:40, Yehuda Sadeh wrote:

On Mon, Dec 1, 2014 at 3:20 PM, Ben <b@benjackson.email> wrote:

On 2014-12-02 09:25, Yehuda Sadeh wrote:

On Mon, Dec 1, 2014 at 2:10 PM, Ben <b@benjackson.email> wrote:

On 2014-12-02 08:39, Yehuda Sadeh wrote:

On Sat, Nov 29, 2014 at 2:26 PM, Ben <b@benjackson.email> wrote:

On 29/11/14 11:40, Yehuda Sadeh wrote:

On Fri, Nov 28, 2014 at 1:38 PM, Ben <b@benjackson.email> 
wrote:

On 29/11/14 01:50, Yehuda Sadeh wrote:

On Thu, Nov 27, 2014 at 9:22 PM, Ben <b@benjackson.email> 
wrote:

On 2014-11-28 15:42, Yehuda Sadeh wrote:

On Thu, Nov 27, 2014 at 2:15 PM, b <b@benjackson.email> 
wrote:

On 2014-11-27 11:36, Yehuda Sadeh wrote:

On Wed, Nov 26, 2014 at 3:49 PM, b <b@benjackson.email> 
wrote:

On 2014-11-27 10:21, Yehuda Sadeh wrote:

On Wed, Nov 26, 2014 at 3:09 PM, b <b@benjackson.email>
wrote:

On 2014-11-27 09:38, Yehuda Sadeh wrote:

On Wed, Nov 26, 2014 at 2:32 PM, b 
<b@benjackson.email>
wrote:

I've been deleting a bucket which originally had 
60TB of
data
in
it,
with
our cluster doing only 1 replication, the total 
usage was
120TB.

I've been deleting the objects slowly using S3 
browser,
and
I
can
see
the
bucket usage is now down to around 2.5TB or 5TB with
duplication,
but
the
usage in the cluster has not changed.

I've looked at garbage collection (radosgw-admin gc 
list
--include
all)
and
it just reports square brackets "[]"

I've run radosgw-admin temp remove 
--date=2014-11-20, and
it
doesn't
appear
to have any effect.

Is there a way to check where this space is being
consumed?

Running 'ceph df' the USED space in the buckets pool 
is
not
showing
any
of
the 57TB that should have been freed up from the 
deletion
so
far.

Running 'radosgw-admin bucket stats | jshon | grep
size_kb_actual'
and
adding up all the buckets usage, this shows that the
space
has
been
freed
from the bucket, but the cluster is all sorts of 
messed
up.

ANY IDEAS? What can I look at?

Can you run 'radosgw-admin gc list --include-all'?

Yehuda

I've done it before, and it just returns square 
brackets []
(see
below)

radosgw-admin gc list --include-all
[]

Do you know which of the rados pools have all that 
extra
data?
Try
to
list that pool's objects, verify that there are no 
surprises
there
(e.g., use 'rados -p <pool> ls').

Yehuda

I'm just running that command now, and its taking some 
time.
There
is
a
large number of objects.

Once it has finished, what should I be looking for?

I assume the pool in question is the one that holds your
objects
data?
You should be looking for objects that are not expected 
to
exist
anymore, and objects of buckets that don't exist anymore. 
The
problem
here is to identify these.
I suggest starting by looking at all the existing 
buckets,
compose
a
list of all the bucket prefixes for the existing buckets, 
and
try
to
look whether there are objects that have different 
prefixes.

Yehuda

Any ideas? I've found the prefix, the number of objects in 
the
pool
that
match that prefix numbers in the 21 millions
The actual 'radosgw-admin bucket stats' command reports it 
as
only
having
1.2 million.

Well, the objects you're seeing are raw objects, and since 
rgw
stripes
the data, it is expected to have more raw objects than 
objects
in
the
bucket. Still, it seems that you have much too many of 
these.
You
can
try to check whether there are pending multipart uploads 
that
were
never completed using the S3 api.
At the moment there's no easy way to figure out which raw
objects
are
not supposed to exist. The process would be like this:
1. rados ls -p <data pool>
keep the list sorted
2. list objects in the bucket
3. for each object in (2), do: radosgw-admin object stat
--bucket=<bucket> --object=<object> 
--rgw-cache-enabled=false
(disabling the cache so that it goes quicker)
4. look at the result of (3), and generate a list of all 
the
parts.
5. sort result of (4), compare it to (1)

Note that if you're running firefly or later, the raw 
objects
are
not
specified explicitly in the command you run at (3), so you 
might
need
a different procedure, e.g., find out the raw objects 
random
string
that is being used, remove it from the list generated in 1,
etc.)

That's basically it.
I'll be interested to figure out what happened, why the 
garbage
collection didn't work correctly. You could try verifying 
that
it's
working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket=<bucket>
--object=<object>
     (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify 
that the
raw
parts are listed there
   - wait a few hours, repeat last step, see that the parts
don't
appear
there anymore
   - run rados -p <pool> ls, check to see if the raw 
objects
still
exist

Yehuda

Not sure where to go from here, and our cluster is slowly
filling
up
while
not clearing any space.

I did the last section:

I'll be interested to figure out what happened, why the 
garbage
collection didn't work correctly. You could try verifying 
that
it's
working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket=<bucket>
--object=<object>
     (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify 
that the
raw
parts are listed there
   - wait a few hours, repeat last step, see that the parts
don't
appear
there anymore
   - run rados -p <pool> ls, check to see if the raw 
objects
still
exist

I added the file, did a stat and it displayed the json 
output
I removed the object and then tried to stat the object, this 
time
it
failed
to stat the object
After this, I ran the gc list include all command and it
displayed
nothing
but the square brackets []

Was the object larger than 512k? Also, did you do it within 
the
300
seconds after removing the object?

There should exist a garbage collection pool (by default 
.rgw.gc,
but
it can be something different if you configured your zone
differently), can you verify that you have it, and if so, 
what
does
it
contain?

Yehuda

Yes, the object was 10M. As soon as I had deleted it from the
bucket,
I
ran
the command to check garbage collection.
There is a .rgw.gc pool, we haven't changed it from default. 
It
contains
a
number of objects ~7800, but the size of the files is 0kb

They're expected to be 0kb, the data only resides in their 
omap, and
that's not reflected in the objects size. You could run 'rados
listomapkeys' on these.

Maybe garbage collection isn't working properly..

our gc settings are the following, we have 2 object gateways 
in
our
cluster
too client.radosgw.obj01 and client.radosgw.obj02 (from
ceph.conf)
[client.radosgw.obj01]
    rgw dns name = ceph.###.###
    host = obj01
    keyring = /etc/ceph/keyring.radosgw.obj01
    rgw socket path = /tmp/radosgw.sock
    log file = /var/log/ceph/radosgw.log
    rgw data = /var/lib/ceph/radosgw/obj01
    rgw thread pool size = 128
    rgw print continue = True
    debug rgw = 0
    rgw enable ops log = False
    log to stderr = False
    rgw enable usage log = False
    rgw gc max objs = 7877

You should put this line (rgw gx max objs) in the global 
section of
your ceph.conf. Either that, or run your radosgw-admin command 
with
'-n client.radosgw.obj02'. That might change some of the 
results
you're seeing (radosgw-admin gc list --include-all, etc.).

Yehuda

    rgw gc obj min wait = 300
    rgw gc processor period = 600
    rgw init timeout = 180
    rgw gc processor max time = 600
[client.radosgw.obj02]
    rgw dns name = ceph.###.###
    host = obj02
    keyring = /etc/ceph/keyring.radosgw.obj02
    rgw socket path = /tmp/radosgw.sock
    log file = /var/log/ceph/radosgw.log
    rgw data = /var/lib/ceph/radosgw/obj02
    rgw thread pool size = 128
    rgw print continue = True
    debug rgw = 0
    rgw enable ops log = False
    log to stderr = False
    rgw enable usage log = False
    rgw gc max objs = 7877
    rgw gc obj min wait = 300
    rgw gc processor period = 600
    rgw init timeout = 180
    rgw gc processor max time = 600

I've finally deleted the entire bucket. All 60TB cleared from 
the
bucket,
the bucket no longer exists.

Yet running rados ls -p .rgw.buckets | grep '4804.14' still 
lists all
the
_shadow_ files that have that buckets prefix.

Any ideas why these aren't being deleted/cleared up by garbage
collection?

Are there any errors in the log? Can you provide a log (debug rgw 
=
20, debug ms = 1) of the radosgw through the garbage collection 
stage?

Yehuda

Hi Yehuda,

Here is a snapshot of the log.

I created a new bucket, uploaded a 25mb file and then deleted it
i did a gc list --include all and could see the shadow files

I then forced a gc process, and the log has gone crazy with this 
stuff
in
it, and it is still going.

2014-12-02 09:07:59.634544 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.13:6816/17803 -- ping v1 -- ?+0 0x7f8558000d20 con 
0x1e9c470
2014-12-02 09:07:59.634578 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.12:6830/3830 -- ping v1 -- ?+0 0x7f8558001120 con 
0x1e93090
2014-12-02 09:07:59.634592 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.14:6818/3517 -- ping v1 -- ?+0 0x7f8558001700 con 
0x1e9d230
2014-12-02 09:07:59.634607 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.10:6806/3397 -- ping v1 -- ?+0 0x7f8558006050 con 
0x1e9e0e0
2014-12-02 09:07:59.634621 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.19:6830/4000 -- ping v1 -- ?+0 0x7f8558000a10 con 
0x1e9fdc0
2014-12-02 09:07:59.634634 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.11:6809/3451 -- ping v1 -- ?+0 0x7f8558001d50 con 
0x1e9ef10
2014-12-02 09:07:59.634647 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.19:6803/3449 -- ping v1 -- ?+0 0x7f8558001fe0 con 
0x1ea1aa0
2014-12-02 09:07:59.634660 7f86256f3700  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.17:6827/14001 -- ping v1 -- ?+0 0x7f85580028c0 con 
0x1ea0bf0
2014-12-02 09:07:59.637762 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.108 10.150.2.18:6827/13867 158 ==== osd_op_reply(14761 gc.1283
[call]
v0'0 uv98856 ondisk = 0) v6 ==== 174+0+11 (1634423764 0 
1993775135)
0x7f8570006050 con 0x1f1e4d0
2014-12-02 09:07:59.637937 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.18:6827/13867 -- osd_op(client.6359480.0:14762 gc.1283 
[call
lock.unlock] 6.10bb9719 ondisk+write e19074) v4 -- ?+0 0x1f30a80 
con
0x1f1e4d0
2014-12-02 09:07:59.652166 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.108 10.150.2.18:6827/13867 159 ==== osd_op_reply(14762 gc.1283
[call]
v19074'98857 uv98857 ondisk = 0) v6 ==== 174+0+0 (1358419719 0 0)
0x7f8570006050 con 0x1f1e4d0
2014-12-02 09:07:59.652293 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.15:6805/1032 -- osd_op(client.6359480.0:14763 gc.1284 
[call
lock.lock] 6.4a5f5aa9 ondisk+write e19074) v4 -- ?+0 0x1f30a80 con
0x1f290a0
2014-12-02 09:07:59.670108 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.53 10.150.2.15:6805/1032 130 ==== osd_op_reply(14763 gc.1284 
[call]
v19074'100398 uv100398 ondisk = 0) v6 ==== 174+0+0 (3807569510 0 
0)
0x7f85b0000b60 con 0x1f290a0
2014-12-02 09:07:59.670239 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.15:6805/1032 -- osd_op(client.6359480.0:14764 gc.1284 
[call
rgw.gc_list] 6.4a5f5aa9 ack+read e19074) v4 -- ?+0 0x1f30a80 con
0x1f290a0
2014-12-02 09:07:59.673871 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.53 10.150.2.15:6805/1032 131 ==== osd_op_reply(14764 gc.1284 
[call]
v0'0
uv100398 ondisk = 0) v6 ==== 174+0+11 (4101371043 0 1993775135)
0x7f85b0000b60 con 0x1f290a0
2014-12-02 09:07:59.674042 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.15:6805/1032 -- osd_op(client.6359480.0:14765 gc.1284 
[call
lock.unlock] 6.4a5f5aa9 ondisk+write e19074) v4 -- ?+0 0x1f30a80 
con
0x1f290a0
2014-12-02 09:07:59.686255 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.53 10.150.2.15:6805/1032 132 ==== osd_op_reply(14765 gc.1284 
[call]
v19074'100399 uv100399 ondisk = 0) v6 ==== 174+0+0 (2220917153 0 
0)
0x7f85b0000b60 con 0x1f290a0
2014-12-02 09:07:59.686376 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.19:6812/3767 -- osd_op(client.6359480.0:14766 gc.1285 
[call
lock.lock] 6.9f912e07 ondisk+write e19074) v4 -- ?+0 0x1f1f560 con
0x1f0ea90
2014-12-02 09:07:59.721432 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.83 10.150.2.19:6812/3767 163 ==== osd_op_reply(14766 gc.1285 
[call]
v19074'302537 uv302537 ondisk = 0) v6 ==== 174+0+0 (2463619648 0 
0)
0x7f85b8000900 con 0x1f0ea90
2014-12-02 09:07:59.721591 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.19:6812/3767 -- osd_op(client.6359480.0:14767 gc.1285 
[call
rgw.gc_list] 6.9f912e07 ack+read e19074) v4 -- ?+0 0x1f30a80 con
0x1f0ea90
2014-12-02 09:07:59.725699 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.83 10.150.2.19:6812/3767 164 ==== osd_op_reply(14767 gc.1285 
[call]
v0'0
uv302537 ondisk = 0) v6 ==== 174+0+11 (3924946656 0 1993775135)
0x7f85b8000900 con 0x1f0ea90
2014-12-02 09:07:59.725868 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.19:6812/3767 -- osd_op(client.6359480.0:14768 gc.1285 
[call
lock.unlock] 6.9f912e07 ondisk+write e19074) v4 -- ?+0 0x1f1f560 
con
0x1f0ea90
2014-12-02 09:07:59.738287 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.83 10.150.2.19:6812/3767 165 ==== osd_op_reply(14768 gc.1285 
[call]
v19074'302538 uv302538 ondisk = 0) v6 ==== 174+0+0 (314638285 0 0)
0x7f85b8000900 con 0x1f0ea90
2014-12-02 09:07:59.738415 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.10:6803/3331 -- osd_op(client.6359480.0:14769 gc.1286 
[call
lock.lock] 6.61beb660 ondisk+write e19074) v4 -- ?+0 0x1f30a80 con
0x1f1ce50
2014-12-02 09:07:59.764523 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.1
10.150.2.10:6803/3331 145 ==== osd_op_reply(14769 gc.1286 [call]
v19074'136677 uv136677 ondisk = 0) v6 ==== 174+0+0 (3559159328 0 
0)
0x7f8540002dd0 con 0x1f1ce50
2014-12-02 09:07:59.764670 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.10:6803/3331 -- osd_op(client.6359480.0:14770 gc.1286 
[call
rgw.gc_list] 6.61beb660 ack+read e19074) v4 -- ?+0 0x1f1f560 con
0x1f1ce50
2014-12-02 09:07:59.767836 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.1
10.150.2.10:6803/3331 146 ==== osd_op_reply(14770 gc.1286 [call] 
v0'0
uv136677 ondisk = 0) v6 ==== 174+0+11 (1532209036 0 1993775135)
0x7f8540002dd0 con 0x1f1ce50
2014-12-02 09:07:59.768009 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.10:6803/3331 -- osd_op(client.6359480.0:14771 gc.1286 
[call
lock.unlock] 6.61beb660 ondisk+write e19074) v4 -- ?+0 0x1f30a80 
con
0x1f1ce50
2014-12-02 09:07:59.780792 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.1
10.150.2.10:6803/3331 147 ==== osd_op_reply(14771 gc.1286 [call]
v19074'136678 uv136678 ondisk = 0) v6 ==== 174+0+0 (1412649901 0 
0)
0x7f8540002dd0 con 0x1f1ce50
2014-12-02 09:07:59.780968 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.12:6833/4016 -- osd_op(client.6359480.0:14772 gc.1287 
[call
lock.lock] 6.9bfdc387 ondisk+write e19074) v4 -- ?+0 0x1f1f560 con
0x1f3b2f0
2014-12-02 09:07:59.804154 7f86277f8700  1 -- 10.150.2.2:0/1009990 
<==
osd.65 10.150.2.12:6833/4016 118 ==== osd_op_reply(14772 gc.1287 
[call]
v19074'279063 uv279063 ondisk = 0) v6 ==== 174+0+0 (4117862336 0 
0)
0x7f85d80024b0 con 0x1f3b2f0
2014-12-02 09:07:59.804279 7f8630ba57c0  1 -- 10.150.2.2:0/1009990 
-->
10.150.2.12:6833/4016 -- osd_op(client.6359480.0:14773 gc.1287 
[call
rgw.gc_list] 6.9bfdc387 ack+read e19074) v4 -- ?+0 0x1f30a80 con
0x1f3b2f0

Can you find the part of the log where it's handling a shard that
actually has an object inside? These are all just iterating over 
empty
shards.

Yehuda

How can I tell if the shard has an object in it from the logs?

Search for a different sequence (e.g., search for rgw.gc_remove).

Yehuda

0 Results in the logs for rgw.gc_remove

Well, something is modifying the gc log. Do you happen to have more
than one radosgw running on the same cluster?

Yehuda

We have 2 radosgw servers
obj01 and obj02
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com