Re: rgw leaking data, orphan search loop

Marius Vaitiekunas <mariusvaitiekunas@xxxxxxxxx> · Mon, 26 Dec 2016 22:34:56 +0200

On Sat, Dec 24, 2016 at 2:47 PM, Wido den Hollander <wido@xxxxxxxx> wrote:

> Op 23 december 2016 om 16:05 schreef Wido den Hollander <wido@xxxxxxxx>:

>

>

>

> > Op 22 december 2016 om 19:00 schreef Orit Wasserman <owasserm@xxxxxxxxxx>:

> >

> >

> > HI Maruis,

> >

> > On Thu, Dec 22, 2016 at 12:00 PM, Marius Vaitiekunas

> > <mariusvaitiekunas@xxxxxxxxx> wrote:

> > > On Thu, Dec 22, 2016 at 11:58 AM, Marius Vaitiekunas

> > > <mariusvaitiekunas@xxxxxxxxx> wrote:

> > >>

> > >> Hi,

> > >>

> > >> 1) I've written before into mailing list, but one more time. We have big

> > >> issues recently with rgw on jewel. because of leaked data - the rate is

> > >> about 50GB/hour.

> > >>

> > >> We've hitted these bugs:

> > >> rgw: fix put_acls for objects starting and ending with underscore

> > >> (issue#17625, pr#11669, Orit Wasserman)

> > >>

> > >> Upgraded to jewel 10.2.5 - no luck.

> > >>

> > >> Also we've hitted this one:

> > >> rgw: RGW loses realm/period/zonegroup/zone data: period overwritten if

> > >> somewhere in the cluster is still running Hammer (issue#17371, pr#11519,

> > >> Orit Wasserman)

> > >>

> > >> Fixed zonemaps - also no luck.

> > >>

> > >> We do not use multisite - only default realm, zonegroup, zone.

> > >>

> > >> We have no more ideas, how these data leak could happen. gc is working -

> > >> we can see it in rgw logs.

> > >>

> > >> Maybe, someone could give any hint about this? Where should we look?

> > >>

> > >>

> > >> 2) Another story is about removing all the leaked/orphan objects.

> > >> radosgw-admin orphans find enters the loop state on stage when it starts

> > >> linking objects.

> > >>

> > >> We've tried to change the number of shards to 16, 64 (default), 512. At

> > >> the moment it's running with shards number 1.

> > >>

> > >> Again, any ideas how to make orphan search happen?

> > >>

> > >>

> > >> I could provide any logs, configs, etc. if someone is ready to help on

> > >> this case.

> > >>

> > >>

> >

> > How many buckets do you have ? how many object in each?

> > Can you provide the output of rados ls -p .rgw.buckets ?

>

> Marius asked me to look into this for him, so I did.

>

> What I found is that at *least* three buckets have way more RADOS objects then they should.

>

> The .rgw.buckets pool has 35.651.590 objects totaling 76880G.

>

> I listed all objects in the .rgw.buckets pool and summed them per bucket, the top 5:

>

>  783844 default.25918901.102486

>  876013 default.25918901.3

> 3325825 default.24201682.7

> 6324217 default.84795862.29891

> 7805208 default.25933378.233873

>

> So I started to rados_stat() (using Python) all the objects in the last three pools. While these stat() calls are still running. I statted about 30% of the objects and their total size is already 17511GB/17TB.

>

> size_kb_actual summed up for bucket default.24201682.7, default.84795862.29891 and default.25933378.233873 sums up to 12TB.

>

> So I'm currently at 30% of statting the objects and I'm already 5TB over the total size of these buckets.

>

The stat calls have finished. The grant total is 65TB.

So while the buckets should consume only 12TB they seems to occupy 65TB of storage.

All these leaking buckets have on thing in common - hadoop S3A client (https://wiki.apache.org/hadoop/AmazonS3) is used. And some of the objects have long names with many underscores For example:
dt=20160814-060014-911/_temporary/0/_temporary/attempt_201608140600_0001_m_000003_339/part-00003.gz
dt=20160814-083014-948/_temporary/0/_temporary/attempt_201608140830_0001_m_000006_294/part-00006.gz

> What I noticed is that it's mainly *shadow* objects which are all 4MB in size.

>

> I know that 'radosgw-admin orphans find --pool=.rgw.buckets --job-id=xyz' should also do this for me, but as mentioned, this keeps looping and hangs.

>

I started this tool about 20 hours ago:

# radosgw-admin orphans find --pool=.rgw.buckets --job-id=wido1 --debug-rados=10 2>&1|gzip > orphans.find.wido1.log.gz

It now shows me this in the logs while it is still running:

2016-12-24 13:41:00.989876 7ff6844d29c0 10 librados: omap-set-vals oid=orphan.scan.wido1.linked.27 nspace=

2016-12-24 13:41:00.993271 7ff6844d29c0 10 librados: Objecter returned from omap-set-vals r=0

storing 2 entries at orphan.scan.wido1.linked.28

2016-12-24 13:41:00.993311 7ff6844d29c0 10 librados: omap-set-vals oid=orphan.scan.wido1.linked.28 nspace=

storing 1 entries at orphan.scan.wido1.linked.31

2016-12-24 13:41:00.995698 7ff6844d29c0 10 librados: Objecter returned from omap-set-vals r=0

2016-12-24 13:41:00.995787 7ff6844d29c0 10 librados: omap-set-vals oid=orphan.scan.wido1.linked.31 nspace=

storing 1 entries at orphan.scan.wido1.linked.33

2016-12-24 13:41:00.997730 7ff6844d29c0 10 librados: Objecter returned from omap-set-vals r=0

2016-12-24 13:41:00.997776 7ff6844d29c0 10 librados: omap-set-vals oid=orphan.scan.wido1.linked.33 nspace=

2016-12-24 13:41:01.000161 7ff6844d29c0 10 librados: Objecter returned from omap-set-vals r=0

storing 1 entries at orphan.scan.wido1.linked.35

2016-12-24 13:41:01.000225 7ff6844d29c0 10 librados: omap-set-vals oid=orphan.scan.wido1.linked.35 nspace=

2016-12-24 13:41:01.002102 7ff6844d29c0 10 librados: Objecter returned from omap-set-vals r=0

storing 1 entries at orphan.scan.wido1.linked.36

2016-12-24 13:41:01.002167 7ff6844d29c0 10 librados: omap-set-vals oid=orphan.scan.wido1.linked.36 nspace=

storing 1 entries at orphan.scan.wido1.linked.39

2016-12-24 13:41:01.004397 7ff6844d29c0 10 librados: Objecter returned from omap-set-vals r=0

It seems to still be doing something, is that correct?

Wido

> So for now I'll probably resort to figuring out which RADOS objects are obsolete by matching against the bucket's index, but that's a lot of manual work.

>

> I'd rather fix the orphans find, so I will probably run that with high logging enabled so we can have some interesting information.

>

> In the meantime, any hints or suggestions?

>

> The cluster is running v10.2.5 btw.

>

> >

> > Orit

> >

> > >

> > > Sorry. I forgot to mention, that we've registered two issues on tracker:

> > > http://tracker.ceph.com/issues/18331

> > > http://tracker.ceph.com/issues/18258

> > >

> > > --

> > > Marius Vaitiekūnas

> > >

> > > _______________________________________________

> > > ceph-users mailing list

> > > ceph-users@xxxxxxxxxxxxxx

> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> > >

> > _______________________________________________

> > ceph-users mailing list

> > ceph-users@xxxxxxxxxxxxxx

> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Marius Vaitiekūnas

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com