Re: All OSD fails after few requests to RGW

David Turner <drakonstein@xxxxxxxxx> · Thu, 11 May 2017 11:52:25 +0000

If you are on the current release of Ceph Hammer 0.94.10 or Jewel 10.2.7, you have it already. I don't remember which release it came out in, but it's definitely in the current releases..

On Thu, May 11, 2017, 12:24 AM Anton Dmitriev <tech@xxxxxxxxxx> wrote:

    "recent enough version of the
      ceph-objectstore-tool" - sounds very interesting. Would it be
      released in one of next Jewel minor releases?

      On 10.05.2017 19:03, David Turner wrote:

      PG subfolder splitting is the primary reason people
        are going to be deploying Luminous and Bluestore much faster
        than any other major release of Ceph.  Bluestore removes the
        concept of subfolders in PGs.

        I have had clusters that reached what seemed a hardcoded
          maximum of 12,800 objects in a subfolder.  It would take an
          osd_heartbeat_grace of 240 or 300 to let them finish splitting
          their subfolders without being marked down.  Recently I came
          across a cluster that had a setting of 240 objects per
          subfolder before splitting, so it was splitting all the time,
          and several of the OSDs took longer than 30 seconds to finish
          splitting into subfolders.  That led to more problems as we
          started adding backfilling to everything and we lost a
          significant amount of throughput on the cluster.

        I have yet to manage a cluster with a recent enough version
          of the ceph-objectstore-tool (hopefully I'll have one this
          month) that includes the ability to take an osd offline, split
          the subfolders, then bring it back online.  If you set up a
          way to monitor how big your subfolders are getting, you can
          leave the ceph settings as high as you want, and then go in
          and perform maintenance on your cluster 1 failure domain at a
          time splitting all of the PG subfolders on the OSDs.  This
          approach would remove this ever happening in the wild.

            On Wed, May 10, 2017 at 5:37 AM Piotr
              Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>
              wrote:

            It is
              difficult for me to clearly state why some PGs have not
              been migrated.

              crushmap settings? Weight of OSD?

              One thing is certain - you will not find any information
              about the split

              process in the logs ...

              pn

              -----Original Message-----

              From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]

              Sent: Wednesday, May 10, 2017 10:14 AM

              To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;

              ceph-users@xxxxxxxxxxxxxx

              Subject: Re:  All OSD fails after few requests
              to RGW

              When I created cluster, I made a mistake in configuration,
              and set split

              parameter to 32 and merge to 40, so 32*40*16 = 20480 files
              per folder.

              After that I changed split to 8, and increased number of
              pg and pgp from

              2048 to 4096 for pool, where problem occurs. While it was
              backfilling I

              observed, that placement groups were backfilling from one
              set of 3 OSD to

              another set of 3 OSD (replicated size = 3), so I made a
              conclusion, that PGs

              are completely recreating while increasing PG and PGP for
              pool and after

              this process number of files per directory must be Ok. But
              when backfilling

              finished I found many directories in this pool with ~20

              000 files. Why Increasing PG num did not helped? Or maybe
              after this process

              some files will be deleted with some delay?

              I couldn`t find any information about directory split
              process in logs, also

              with osd and filestore debug 20. What pattern and in what
              log I need to grep

              for finding it?

              On 10.05.2017 10:36, Piotr Nowosielski wrote:

              > You can:

              > - change these parameters and use
              ceph-objectstore-tool

              > - add OSD host - rebuild the cluster will reduce the
              number of files

              > in the directories

              > - wait until "split" operations are over ;-)

              >

              > In our case, we could afford to wait until the
              "split" operation is

              > over (we have 2 clusters in slightly different
              configurations storing

              > the same data)

              >

              > hint:

              > When creating a new pool, use the parameter
              "expected_num_objects"

              > https://www.suse.com/documentation/ses-4/book_storage_admin/data/ceph_

              > pools_operate.html

              >

              > Piotr Nowosielski

              > Senior Systems Engineer

              > Zespół Infrastruktury 5

              > Grupa Allegro sp. z o.o.

              > Tel: +48 512 08 55 92

              >

              >

              > -----Original Message-----

              > From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]

              > Sent: Wednesday, May 10, 2017 9:19 AM

              > To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;

              > ceph-users@xxxxxxxxxxxxxx

              > Subject: Re:  All OSD fails after few
              requests to RGW

              >

              > How did you solved it? Set new split/merge
              thresholds, and manually

              > applied it by ceph-objectstore-tool --data-path

              > /var/lib/ceph/osd/ceph-${osd_num} --journal-path

              > /var/lib/ceph/osd/ceph-${osd_num}/journal

              >
              --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
              --op

              > apply-layout-settings --pool default.rgw.buckets.data

              >

              > on each OSD?

              >

              > How I can see in logs, that split occurs?

              >

              > On 10.05.2017 10:13, Piotr Nowosielski wrote:

              >> Hey,

              >> We had similar problems. Look for information on
              "Filestore merge and

              >> split".

              >>

              >> Some explain:

              >> The OSD, after reaching a certain number of files
              in the directory

              >> (it depends of 'filestore merge threshold' and
              'filestore split multiple'

              >> parameters) rebuilds the structure of this
              directory.

              >> If the files arrives, the OSD creates new
              subdirectories and moves

              >> some of the files there.

              >> If the files are missing the OSD will reduce the
              number of

              >> subdirectories.

              >>

              >>

              >> --

              >> Piotr Nowosielski

              >> Senior Systems Engineer

              >> Zespół Infrastruktury 5

              >> Grupa Allegro sp. z o.o.

              >> Tel: +48 512 08 55 92

              >>

              >> Grupa Allegro Sp. z o.o. z siedzibą w Poznaniu,
              60-166 Poznań, przy ul.

              >> Grunwaldzka 182, wpisana do rejestru
              przedsiębiorców prowadzonego

              >> przez Sąd Rejonowy Poznań - Nowe Miasto i Wilda,
              Wydział VIII

              >> Gospodarczy Krajowego Rejestru Sądowego pod
              numerem KRS 0000268796, o

              >> kapitale zakładowym w wysokości 33 976 500,00 zł,
              posiadająca numer

              >> identyfikacji podatkowej NIP: 5272525995.

              >>

              >>

              >>

              >> -----Original Message-----

              >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
              On Behalf

              >> Of Anton Dmitriev

              >> Sent: Wednesday, May 10, 2017 8:14 AM

              >> To: ceph-users@xxxxxxxxxxxxxx

              >> Subject: Re:  All OSD fails after few
              requests to RGW

              >>

              >> Hi!

              >>

              >> I increased pg_num and pgp_num for pool
              default.rgw.buckets.data from

              >> 2048 to 4096, and it seems that situation became
              a bit better,

              >> cluster dies after 20-30 PUTs, not after 1. Could
              someone please give

              >> me some recommendations how to rescue the
              cluster?

              >>

              >> On 27.04.2017 09:59, Anton Dmitriev wrote:

              >>> Cluster was going well for a long time, but
              on the previous week

              >>> osds start to fail.

              >>> We use cluster like image storage for
              Opennebula with small load and

              >>> like object storage with high load.

              >>> Sometimes disks of some osds utlized by 100
              %, iostat shows avgqu-sz

              >>> over 1000, while reading or writing a few
              kilobytes in a second,

              >>> osds on this disks become unresponsive and
              cluster marks them down.

              >>> We lower the load to object storage and
              situation became better.

              >>>

              >>> Yesterday situation became worse:

              >>> If RGWs are disabled and there is no requests
              to object storage

              >>> cluster performing well, but if enable RGWs
              and make a few PUTs or

              >>> GETs all not SSD osds on all storages become
              in the same situation,

              >>> described above.

              >>> IOtop shows, that xfsaild/<disk> burns
              disks.

              >>>

              >>> trace-cmd record -e xfs\*  for a 10 seconds
              shows 10 milion objects,

              >>> as i understand it means ~360 000 objects to
              push per one osd for a

              >>> 10 seconds

              >>>      $ wc -l t.t

              >>> 10256873 t.t

              >>>

              >>> fragmentation on one of such disks is about
              3%

              >>>

              >>> more information about cluster:

              >>>

              >>> https://yadi.sk/d/Y63mXQhl3HPvwt

              >>>

              >>> also debug logs for osd.33 while problem
              occurs

              >>>

              >>> https://yadi.sk/d/kiqsMF9L3HPvte

              >>>

              >>> debug_osd = 20/20

              >>> debug_filestore = 20/20

              >>> debug_tp = 20/20

              >>>

              >>>

              >>>

              >>> Ubuntu 14.04

              >>> $ uname -a

              >>> Linux storage01 4.2.0-42-generic
              #49~14.04.1-Ubuntu SMP Wed Jun 29

              >>> 20:22:11 UTC 2016 x86_64 x86_64 x86_64
              GNU/Linux

              >>>

              >>> Ceph 10.2.7

              >>>

              >>> 7 storages: Supermicro 28 osd 4tb 7200 JBOD +
              journal raid10 4 ssd

              >>> intel 3510 800gb + 2 osd SSD intel 3710 400gb
              for rgw meta and index

              >>> One of this storages differs only in number
              of osd, it has 26 osd on

              >>> 4tb, instead of 28 on others

              >>>

              >>> Storages connect to each other by bonded
              2x10gbit Clients connect to

              >>> storages by bonded 2x1gbit

              >>>

              >>> in 5 storages 2 x CPU E5-2650v2  and 256 gb
              RAM in 2 storages 2 x

              >>> CPU

              >>> E5-2690v3  and 512 gb RAM

              >>>

              >>> 7 mons

              >>> 3 rgw

              >>>

              >>> Help me please to rescue the cluster.

              >>>

              >>>

              >> --

              >> Dmitriev Anton

              >>

              >> _______________________________________________

              >> ceph-users mailing list

              >> ceph-users@xxxxxxxxxxxxxx

              >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

              >

              > --

              > Dmitriev Anton

              --

              Dmitriev Anton

              _______________________________________________

              ceph-users mailing list

              ceph-users@xxxxxxxxxxxxxx

              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Dmitriev Anton

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com