Re: All OSD fails after few requests to RGW

Anton Dmitriev <tech@xxxxxxxxxx> · Wed, 10 May 2017 11:13:56 +0300

When I created cluster, I made a mistake in configuration, and set split 
parameter to 32 and merge to 40, so 32*40*16 = 20480 files per folder. 
After that I changed split to 8, and increased number of pg and pgp from 
2048 to 4096 for pool, where problem occurs. While it was backfilling I 
observed, that placement groups were backfilling from one set of 3 OSD 
to another set of 3 OSD (replicated size = 3), so I made a conclusion, 
that PGs are completely recreating while increasing PG and PGP for pool 
and after this process number of files per directory must be Ok. But 
when backfilling finished I found many directories in this pool with ~20 
000 files. Why Increasing PG num did not helped? Or maybe after this 
process some files will be deleted with some delay?

I couldn`t find any information about directory split process in logs, 
also with osd and filestore debug 20. What pattern and in what log I 
need to grep for finding it?

On 10.05.2017 10:36, Piotr Nowosielski wrote:
You can:
- change these parameters and use ceph-objectstore-tool
- add OSD host - rebuild the cluster will reduce the number of files in the
directories
- wait until "split" operations are over ;-)

In our case, we could afford to wait until the "split" operation is over (we
have 2 clusters in slightly different configurations storing the same data)

hint:
When creating a new pool, use the parameter "expected_num_objects"
https://www.suse.com/documentation/ses-4/book_storage_admin/data/ceph_pools_operate.html

Piotr Nowosielski
Senior Systems Engineer
Zespół Infrastruktury 5
Grupa Allegro sp. z o.o.
Tel: +48 512 08 55 92

-----Original Message-----
From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]
Sent: Wednesday, May 10, 2017 9:19 AM
To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;
ceph-users@xxxxxxxxxxxxxx
Subject: Re:  All OSD fails after few requests to RGW

How did you solved it? Set new split/merge thresholds, and manually applied
it by ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-${osd_num} --journal-path
/var/lib/ceph/osd/ceph-${osd_num}/journal
--log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
apply-layout-settings --pool default.rgw.buckets.data

on each OSD?

How I can see in logs, that split occurs?

On 10.05.2017 10:13, Piotr Nowosielski wrote:
Hey,
We had similar problems. Look for information on "Filestore merge and
split".

Some explain:
The OSD, after reaching a certain number of files in the directory (it
depends of 'filestore merge threshold' and 'filestore split multiple'
parameters) rebuilds the structure of this directory.
If the files arrives, the OSD creates new subdirectories and moves
some of the files there.
If the files are missing the OSD will reduce the number of subdirectories.

--
Piotr Nowosielski
Senior Systems Engineer
Zespół Infrastruktury 5
Grupa Allegro sp. z o.o.
Tel: +48 512 08 55 92

Grupa Allegro Sp. z o.o. z siedzibą w Poznaniu, 60-166 Poznań, przy ul.
Grunwaldzka 182, wpisana do rejestru przedsiębiorców prowadzonego
przez Sąd Rejonowy Poznań - Nowe Miasto i Wilda, Wydział VIII
Gospodarczy Krajowego Rejestru Sądowego pod numerem KRS 0000268796, o
kapitale zakładowym w wysokości 33 976 500,00 zł, posiadająca numer
identyfikacji podatkowej NIP: 5272525995.

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
Of Anton Dmitriev
Sent: Wednesday, May 10, 2017 8:14 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  All OSD fails after few requests to RGW

Hi!

I increased pg_num and pgp_num for pool default.rgw.buckets.data from
2048 to 4096, and it seems that situation became a bit better,
cluster dies after 20-30 PUTs, not after 1. Could someone please give
me some recommendations how to rescue the cluster?

On 27.04.2017 09:59, Anton Dmitriev wrote:
Cluster was going well for a long time, but on the previous week osds
start to fail.
We use cluster like image storage for Opennebula with small load and
like object storage with high load.
Sometimes disks of some osds utlized by 100 %, iostat shows avgqu-sz
over 1000, while reading or writing a few kilobytes in a second, osds
on this disks become unresponsive and cluster marks them down. We
lower the load to object storage and situation became better.

Yesterday situation became worse:
If RGWs are disabled and there is no requests to object storage
cluster performing well, but if enable RGWs and make a few PUTs or
GETs all not SSD osds on all storages become in the same situation,
described above.
IOtop shows, that xfsaild/<disk> burns disks.

trace-cmd record -e xfs\*  for a 10 seconds shows 10 milion objects,
as i understand it means ~360 000 objects to push per one osd for a
10 seconds
     $ wc -l t.t
10256873 t.t

fragmentation on one of such disks is about 3%

more information about cluster:

https://yadi.sk/d/Y63mXQhl3HPvwt

also debug logs for osd.33 while problem occurs

https://yadi.sk/d/kiqsMF9L3HPvte

debug_osd = 20/20
debug_filestore = 20/20
debug_tp = 20/20

Ubuntu 14.04
$ uname -a
Linux storage01 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29
20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Ceph 10.2.7

7 storages: Supermicro 28 osd 4tb 7200 JBOD + journal raid10 4 ssd
intel 3510 800gb + 2 osd SSD intel 3710 400gb for rgw meta and index
One of this storages differs only in number of osd, it has 26 osd on
4tb, instead of 28 on others

Storages connect to each other by bonded 2x10gbit Clients connect to
storages by bonded 2x1gbit

in 5 storages 2 x CPU E5-2650v2  and 256 gb RAM in 2 storages 2 x CPU
E5-2690v3  and 512 gb RAM

7 mons
3 rgw

Help me please to rescue the cluster.

--
Dmitriev Anton

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitriev Anton

--
Dmitriev Anton

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com