Using ceph-objectstore-tool
apply-layout-settings I applied layout on all storages to
filestore_merge_threshold = 40
filestore_split_multiple = 8
I checked some directories in OSDs, there were 1200-2000 files per
folder
Split will occur on 5120 files per folder
But problem still exists, after I PUT 25-50 objects to RGW, one of
OSD disks became busy 100%, iotop shows that xfaild makes it busy,
number of slow requests increasing to 800-1000. After some time
busy OSD hit suicide timeout, restarts and cluster works well,
until next write to RGW.
0> 2017-05-21 12:02:26.105597 7f23bd1fa700 -1
common/HeartbeatMap.cc: In function 'bool
ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const
char*, time_t)' thread 7f23bd1fa700 time 2017-05-21
12:02:26.050994
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide
timeout")
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0x5620cbbb56db]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
char const*, long)+0x2b1) [0x5620cbafba91]
3: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x5620cbafc256]
4: (OSD::handle_osd_ping(MOSDPing*)+0x8e2) [0x5620cb557752]
5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x5620cb55899b]
6: (DispatchQueue::fast_dispatch(Message*)+0x76) [0x5620cbc71906]
7: (Pipe::reader()+0x1d38) [0x5620cbcaceb8]
8: (Pipe::Reader::entry()+0xd) [0x5620cbcb4a0d]
9: (()+0x8184) [0x7f244a0ea184]
10: (clone()+0x6d) [0x7f2448215bed]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
2017-05-21 12:02:26.161763 7f23bd1fa700 -1 *** Caught signal
(Aborted) **
in thread 7f23bd1fa700 thread_name:ms_pipe_read
On 11.05.2017 20:11, David Turner wrote:
I honestly haven't investigated the command line
structure that it would need, but that looks about what I'd
expect.
I`m on
Jewel 10.2.7
Do you mean this:
ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-${osd_num} --journal-path
/var/lib/ceph/osd/ceph-${osd_num}/journal
--log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
--op apply-layout-settings --pool default.rgw.buckets.data
--debug
?
And before running it I need to stop OSD and flush its
journal
On 11.05.2017 14:52, David Turner wrote:
If you are on the current release of Ceph
Hammer 0.94.10 or Jewel 10.2.7, you have it already. I
don't remember which release it came out in, but it's
definitely in the current releases..
"recent
enough version of the ceph-objectstore-tool" -
sounds very interesting. Would it be released in
one of next Jewel minor releases?
On 10.05.2017 19:03, David Turner wrote:
PG subfolder splitting is the
primary reason people are going to be deploying
Luminous and Bluestore much faster than any
other major release of Ceph. Bluestore removes
the concept of subfolders in PGs.
I have had clusters that reached what
seemed a hardcoded maximum of 12,800 objects
in a subfolder. It would take an
osd_heartbeat_grace of 240 or 300 to let them
finish splitting their subfolders without
being marked down. Recently I came across a
cluster that had a setting of 240 objects per
subfolder before splitting, so it was
splitting all the time, and several of the
OSDs took longer than 30 seconds to finish
splitting into subfolders. That led to more
problems as we started adding backfilling to
everything and we lost a significant amount of
throughput on the cluster.
I have yet to manage a cluster with a
recent enough version of the
ceph-objectstore-tool (hopefully I'll have one
this month) that includes the ability to take
an osd offline, split the subfolders, then
bring it back online. If you set up a way to
monitor how big your subfolders are getting,
you can leave the ceph settings as high as you
want, and then go in and perform maintenance
on your cluster 1 failure domain at a time
splitting all of the PG subfolders on the
OSDs. This approach would remove this ever
happening in the wild.
It is
difficult for me to clearly state why some
PGs have not been migrated.
crushmap settings? Weight of OSD?
One thing is certain - you will not find
any information about the split
process in the logs ...
pn
-----Original Message-----
From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]
Sent: Wednesday, May 10, 2017 10:14 AM
To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;
ceph-users@xxxxxxxxxxxxxx
Subject: Re: All OSD fails
after few requests to RGW
When I created cluster, I made a mistake
in configuration, and set split
parameter to 32 and merge to 40, so
32*40*16 = 20480 files per folder.
After that I changed split to 8, and
increased number of pg and pgp from
2048 to 4096 for pool, where problem
occurs. While it was backfilling I
observed, that placement groups were
backfilling from one set of 3 OSD to
another set of 3 OSD (replicated size =
3), so I made a conclusion, that PGs
are completely recreating while increasing
PG and PGP for pool and after
this process number of files per directory
must be Ok. But when backfilling
finished I found many directories in this
pool with ~20
000 files. Why Increasing PG num did not
helped? Or maybe after this process
some files will be deleted with some
delay?
I couldn`t find any information about
directory split process in logs, also
with osd and filestore debug 20. What
pattern and in what log I need to grep
for finding it?
On 10.05.2017 10:36, Piotr Nowosielski
wrote:
> You can:
> - change these parameters and use
ceph-objectstore-tool
> - add OSD host - rebuild the cluster
will reduce the number of files
> in the directories
> - wait until "split" operations are
over ;-)
>
> In our case, we could afford to wait
until the "split" operation is
> over (we have 2 clusters in slightly
different configurations storing
> the same data)
>
> hint:
> When creating a new pool, use the
parameter "expected_num_objects"
> https://www.suse.com/documentation/ses-4/book_storage_admin/data/ceph_
> pools_operate.html
>
> Piotr Nowosielski
> Senior Systems Engineer
> Zespół Infrastruktury 5
> Grupa Allegro sp. z o.o.
> Tel: +48 512 08 55 92
>
>
> -----Original Message-----
> From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]
> Sent: Wednesday, May 10, 2017 9:19 AM
> To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;
> ceph-users@xxxxxxxxxxxxxx
> Subject: Re: All OSD
fails after few requests to RGW
>
> How did you solved it? Set new
split/merge thresholds, and manually
> applied it by ceph-objectstore-tool
--data-path
> /var/lib/ceph/osd/ceph-${osd_num}
--journal-path
>
/var/lib/ceph/osd/ceph-${osd_num}/journal
>
--log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
--op
> apply-layout-settings --pool
default.rgw.buckets.data
>
> on each OSD?
>
> How I can see in logs, that split
occurs?
>
> On 10.05.2017 10:13, Piotr
Nowosielski wrote:
>> Hey,
>> We had similar problems. Look for
information on "Filestore merge and
>> split".
>>
>> Some explain:
>> The OSD, after reaching a certain
number of files in the directory
>> (it depends of 'filestore merge
threshold' and 'filestore split multiple'
>> parameters) rebuilds the
structure of this directory.
>> If the files arrives, the OSD
creates new subdirectories and moves
>> some of the files there.
>> If the files are missing the OSD
will reduce the number of
>> subdirectories.
>>
>>
>> --
>> Piotr Nowosielski
>> Senior Systems Engineer
>> Zespół Infrastruktury 5
>> Grupa Allegro sp. z o.o.
>> Tel: +48 512 08 55 92
>>
>> Grupa Allegro Sp. z o.o. z
siedzibą w Poznaniu, 60-166 Poznań, przy
ul.
>> Grunwaldzka 182, wpisana do
rejestru przedsiębiorców prowadzonego
>> przez Sąd Rejonowy Poznań - Nowe
Miasto i Wilda, Wydział VIII
>> Gospodarczy Krajowego Rejestru
Sądowego pod numerem KRS 0000268796, o
>> kapitale zakładowym w wysokości
33 976 500,00 zł, posiadająca numer
>> identyfikacji podatkowej NIP:
5272525995.
>>
>>
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf
>> Of Anton Dmitriev
>> Sent: Wednesday, May 10, 2017
8:14 AM
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re: All OSD
fails after few requests to RGW
>>
>> Hi!
>>
>> I increased pg_num and pgp_num
for pool default.rgw.buckets.data from
>> 2048 to 4096, and it seems that
situation became a bit better,
>> cluster dies after 20-30 PUTs,
not after 1. Could someone please give
>> me some recommendations how to
rescue the cluster?
>>
>> On 27.04.2017 09:59, Anton
Dmitriev wrote:
>>> Cluster was going well for a
long time, but on the previous week
>>> osds start to fail.
>>> We use cluster like image
storage for Opennebula with small load and
>>> like object storage with high
load.
>>> Sometimes disks of some osds
utlized by 100 %, iostat shows avgqu-sz
>>> over 1000, while reading or
writing a few kilobytes in a second,
>>> osds on this disks become
unresponsive and cluster marks them down.
>>> We lower the load to object
storage and situation became better.
>>>
>>> Yesterday situation became
worse:
>>> If RGWs are disabled and
there is no requests to object storage
>>> cluster performing well, but
if enable RGWs and make a few PUTs or
>>> GETs all not SSD osds on all
storages become in the same situation,
>>> described above.
>>> IOtop shows, that
xfsaild/<disk> burns disks.
>>>
>>> trace-cmd record -e xfs\*
for a 10 seconds shows 10 milion objects,
>>> as i understand it means ~360
000 objects to push per one osd for a
>>> 10 seconds
>>> $ wc -l t.t
>>> 10256873 t.t
>>>
>>> fragmentation on one of such
disks is about 3%
>>>
>>> more information about
cluster:
>>>
>>> https://yadi.sk/d/Y63mXQhl3HPvwt
>>>
>>> also debug logs for osd.33
while problem occurs
>>>
>>> https://yadi.sk/d/kiqsMF9L3HPvte
>>>
>>> debug_osd = 20/20
>>> debug_filestore = 20/20
>>> debug_tp = 20/20
>>>
>>>
>>>
>>> Ubuntu 14.04
>>> $ uname -a
>>> Linux storage01
4.2.0-42-generic #49~14.04.1-Ubuntu SMP
Wed Jun 29
>>> 20:22:11 UTC 2016 x86_64
x86_64 x86_64 GNU/Linux
>>>
>>> Ceph 10.2.7
>>>
>>> 7 storages: Supermicro 28 osd
4tb 7200 JBOD + journal raid10 4 ssd
>>> intel 3510 800gb + 2 osd SSD
intel 3710 400gb for rgw meta and index
>>> One of this storages differs
only in number of osd, it has 26 osd on
>>> 4tb, instead of 28 on others
>>>
>>> Storages connect to each
other by bonded 2x10gbit Clients connect
to
>>> storages by bonded 2x1gbit
>>>
>>> in 5 storages 2 x CPU
E5-2650v2 and 256 gb RAM in 2 storages 2
x
>>> CPU
>>> E5-2690v3 and 512 gb RAM
>>>
>>> 7 mons
>>> 3 rgw
>>>
>>> Help me please to rescue the
cluster.
>>>
>>>
>> --
>> Dmitriev Anton
>>
>>
_______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Dmitriev Anton
--
Dmitriev Anton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Dmitriev Anton
|