Re: All OSD fails after few requests to RGW

Anton Dmitriev <tech@xxxxxxxxxx> · Thu, 11 May 2017 14:58:03 +0300



    I`m on Jewel 10.2.7

      Do you mean this:

      ceph-objectstore-tool --data-path
      /var/lib/ceph/osd/ceph-${osd_num} --journal-path
      /var/lib/ceph/osd/ceph-${osd_num}/journal
      --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
      apply-layout-settings --pool default.rgw.buckets.data --debug

      
      ?

      And before running it I need to stop OSD and flush its journal

      
      On 11.05.2017 14:52, David Turner wrote:

    
      If you are on the current release of Ceph Hammer
        0.94.10 or Jewel 10.2.7, you have it already. I don't remember
        which release it came out in, but it's definitely in the current
        releases..
      

        On Thu, May 11, 2017, 12:24 AM Anton Dmitriev
          <tech@xxxxxxxxxx>
          wrote:

        
            "recent
              enough version of the ceph-objectstore-tool" - sounds very
              interesting. Would it be released in one of next Jewel
              minor releases?
          
          
              On 10.05.2017 19:03, David Turner wrote:

            
              PG subfolder splitting is the primary
                reason people are going to be deploying Luminous and
                Bluestore much faster than any other major release of
                Ceph.  Bluestore removes the concept of subfolders in
                PGs.
                

                I have had clusters that reached what seemed a
                  hardcoded maximum of 12,800 objects in a subfolder. 
                  It would take an osd_heartbeat_grace of 240 or 300 to
                  let them finish splitting their subfolders without
                  being marked down.  Recently I came across a cluster
                  that had a setting of 240 objects per subfolder before
                  splitting, so it was splitting all the time, and
                  several of the OSDs took longer than 30 seconds to
                  finish splitting into subfolders.  That led to more
                  problems as we started adding backfilling to
                  everything and we lost a significant amount of
                  throughput on the cluster.
                

                I have yet to manage a cluster with a recent enough
                  version of the ceph-objectstore-tool (hopefully I'll
                  have one this month) that includes the ability to take
                  an osd offline, split the subfolders, then bring it
                  back online.  If you set up a way to monitor how big
                  your subfolders are getting, you can leave the ceph
                  settings as high as you want, and then go in and
                  perform maintenance on your cluster 1 failure domain
                  at a time splitting all of the PG subfolders on the
                  OSDs.  This approach would remove this ever happening
                  in the wild.
                

                    On Wed, May 10, 2017 at 5:37 AM Piotr
                      Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>
                      wrote:

                    
                    It
                      is difficult for me to clearly state why some PGs
                      have not been migrated.

                      crushmap settings? Weight of OSD?

                      
                      One thing is certain - you will not find any
                      information about the split

                      process in the logs ...

                      
                      pn

                      
                      -----Original Message-----

                      From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]

                      Sent: Wednesday, May 10, 2017 10:14 AM

                      To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;

                      ceph-users@xxxxxxxxxxxxxx

                      Subject: Re:  All OSD fails after few
                      requests to RGW

                      
                      When I created cluster, I made a mistake in
                      configuration, and set split

                      parameter to 32 and merge to 40, so 32*40*16 =
                      20480 files per folder.

                      After that I changed split to 8, and increased
                      number of pg and pgp from

                      2048 to 4096 for pool, where problem occurs. While
                      it was backfilling I

                      observed, that placement groups were backfilling
                      from one set of 3 OSD to

                      another set of 3 OSD (replicated size = 3), so I
                      made a conclusion, that PGs

                      are completely recreating while increasing PG and
                      PGP for pool and after

                      this process number of files per directory must be
                      Ok. But when backfilling

                      finished I found many directories in this pool
                      with ~20

                      000 files. Why Increasing PG num did not helped?
                      Or maybe after this process

                      some files will be deleted with some delay?

                      
                      I couldn`t find any information about directory
                      split process in logs, also

                      with osd and filestore debug 20. What pattern and
                      in what log I need to grep

                      for finding it?

                      
                      On 10.05.2017 10:36, Piotr Nowosielski wrote:

                      > You can:

                      > - change these parameters and use
                      ceph-objectstore-tool

                      > - add OSD host - rebuild the cluster will
                      reduce the number of files

                      > in the directories

                      > - wait until "split" operations are over ;-)

                      >

                      > In our case, we could afford to wait until
                      the "split" operation is

                      > over (we have 2 clusters in slightly
                      different configurations storing

                      > the same data)

                      >

                      > hint:

                      > When creating a new pool, use the parameter
                      "expected_num_objects"

                      > https://www.suse.com/documentation/ses-4/book_storage_admin/data/ceph_

                      > pools_operate.html

                      >

                      > Piotr Nowosielski

                      > Senior Systems Engineer

                      > Zespół Infrastruktury 5

                      > Grupa Allegro sp. z o.o.

                      > Tel: +48 512 08 55 92

                      >

                      >

                      > -----Original Message-----

                      > From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]

                      > Sent: Wednesday, May 10, 2017 9:19 AM

                      > To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;

                      > ceph-users@xxxxxxxxxxxxxx

                      > Subject: Re:  All OSD fails after
                      few requests to RGW

                      >

                      > How did you solved it? Set new split/merge
                      thresholds, and manually

                      > applied it by ceph-objectstore-tool
                      --data-path

                      > /var/lib/ceph/osd/ceph-${osd_num}
                      --journal-path

                      > /var/lib/ceph/osd/ceph-${osd_num}/journal

                      >
                      --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
                      --op

                      > apply-layout-settings --pool
                      default.rgw.buckets.data

                      >

                      > on each OSD?

                      >

                      > How I can see in logs, that split occurs?

                      >

                      > On 10.05.2017 10:13, Piotr Nowosielski wrote:

                      >> Hey,

                      >> We had similar problems. Look for
                      information on "Filestore merge and

                      >> split".

                      >>

                      >> Some explain:

                      >> The OSD, after reaching a certain number
                      of files in the directory

                      >> (it depends of 'filestore merge
                      threshold' and 'filestore split multiple'

                      >> parameters) rebuilds the structure of
                      this directory.

                      >> If the files arrives, the OSD creates new
                      subdirectories and moves

                      >> some of the files there.

                      >> If the files are missing the OSD will
                      reduce the number of

                      >> subdirectories.

                      >>

                      >>

                      >> --

                      >> Piotr Nowosielski

                      >> Senior Systems Engineer

                      >> Zespół Infrastruktury 5

                      >> Grupa Allegro sp. z o.o.

                      >> Tel: +48 512 08 55 92

                      >>

                      >> Grupa Allegro Sp. z o.o. z siedzibą w
                      Poznaniu, 60-166 Poznań, przy ul.

                      >> Grunwaldzka 182, wpisana do rejestru
                      przedsiębiorców prowadzonego

                      >> przez Sąd Rejonowy Poznań - Nowe Miasto i
                      Wilda, Wydział VIII

                      >> Gospodarczy Krajowego Rejestru Sądowego
                      pod numerem KRS 0000268796, o

                      >> kapitale zakładowym w wysokości 33 976
                      500,00 zł, posiadająca numer

                      >> identyfikacji podatkowej NIP: 5272525995.

                      >>

                      >>

                      >>

                      >> -----Original Message-----

                      >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
                      On Behalf

                      >> Of Anton Dmitriev

                      >> Sent: Wednesday, May 10, 2017 8:14 AM

                      >> To: ceph-users@xxxxxxxxxxxxxx

                      >> Subject: Re:  All OSD fails
                      after few requests to RGW

                      >>

                      >> Hi!

                      >>

                      >> I increased pg_num and pgp_num for pool
                      default.rgw.buckets.data from

                      >> 2048 to 4096, and it seems that situation
                      became a bit better,

                      >> cluster dies after 20-30 PUTs, not after
                      1. Could someone please give

                      >> me some recommendations how to rescue the
                      cluster?

                      >>

                      >> On 27.04.2017 09:59, Anton Dmitriev
                      wrote:

                      >>> Cluster was going well for a long
                      time, but on the previous week

                      >>> osds start to fail.

                      >>> We use cluster like image storage for
                      Opennebula with small load and

                      >>> like object storage with high load.

                      >>> Sometimes disks of some osds utlized
                      by 100 %, iostat shows avgqu-sz

                      >>> over 1000, while reading or writing a
                      few kilobytes in a second,

                      >>> osds on this disks become
                      unresponsive and cluster marks them down.

                      >>> We lower the load to object storage
                      and situation became better.

                      >>>

                      >>> Yesterday situation became worse:

                      >>> If RGWs are disabled and there is no
                      requests to object storage

                      >>> cluster performing well, but if
                      enable RGWs and make a few PUTs or

                      >>> GETs all not SSD osds on all storages
                      become in the same situation,

                      >>> described above.

                      >>> IOtop shows, that
                      xfsaild/<disk> burns disks.

                      >>>

                      >>> trace-cmd record -e xfs\*  for a 10
                      seconds shows 10 milion objects,

                      >>> as i understand it means ~360 000
                      objects to push per one osd for a

                      >>> 10 seconds

                      >>>      $ wc -l t.t

                      >>> 10256873 t.t

                      >>>

                      >>> fragmentation on one of such disks is
                      about 3%

                      >>>

                      >>> more information about cluster:

                      >>>

                      >>> https://yadi.sk/d/Y63mXQhl3HPvwt

                      >>>

                      >>> also debug logs for osd.33 while
                      problem occurs

                      >>>

                      >>> https://yadi.sk/d/kiqsMF9L3HPvte

                      >>>

                      >>> debug_osd = 20/20

                      >>> debug_filestore = 20/20

                      >>> debug_tp = 20/20

                      >>>

                      >>>

                      >>>

                      >>> Ubuntu 14.04

                      >>> $ uname -a

                      >>> Linux storage01 4.2.0-42-generic
                      #49~14.04.1-Ubuntu SMP Wed Jun 29

                      >>> 20:22:11 UTC 2016 x86_64 x86_64
                      x86_64 GNU/Linux

                      >>>

                      >>> Ceph 10.2.7

                      >>>

                      >>> 7 storages: Supermicro 28 osd 4tb
                      7200 JBOD + journal raid10 4 ssd

                      >>> intel 3510 800gb + 2 osd SSD intel
                      3710 400gb for rgw meta and index

                      >>> One of this storages differs only in
                      number of osd, it has 26 osd on

                      >>> 4tb, instead of 28 on others

                      >>>

                      >>> Storages connect to each other by
                      bonded 2x10gbit Clients connect to

                      >>> storages by bonded 2x1gbit

                      >>>

                      >>> in 5 storages 2 x CPU E5-2650v2  and
                      256 gb RAM in 2 storages 2 x

                      >>> CPU

                      >>> E5-2690v3  and 512 gb RAM

                      >>>

                      >>> 7 mons

                      >>> 3 rgw

                      >>>

                      >>> Help me please to rescue the cluster.

                      >>>

                      >>>

                      >> --

                      >> Dmitriev Anton

                      >>

                      >>
                      _______________________________________________

                      >> ceph-users mailing list

                      >> ceph-users@xxxxxxxxxxxxxx

                      >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                      >

                      > --

                      > Dmitriev Anton

                      
                      --

                      Dmitriev Anton

                      _______________________________________________

                      ceph-users mailing list

                      ceph-users@xxxxxxxxxxxxxx

                      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                    
            -- 
Dmitriev Anton
          
        
    -- 
Dmitriev Anton
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com