Re: All OSD fails after few requests to RGW

Anton Dmitriev <tech@xxxxxxxxxx> · Sun, 21 May 2017 12:16:52 +0300



    Using ceph-objectstore-tool
      apply-layout-settings  I applied layout on all storages to 

        filestore_merge_threshold = 40

        filestore_split_multiple = 8

      
      I checked some directories in OSDs, there were 1200-2000 files per
      folder 

      Split will occur on 5120 files per folder

      
      But problem still exists, after I PUT 25-50 objects to RGW, one of
      OSD disks became busy 100%, iotop shows that xfaild makes it busy,
      number of slow requests increasing to 800-1000. After some time
      busy OSD hit suicide timeout, restarts and cluster works well,
      until next write to RGW.

      
           0> 2017-05-21 12:02:26.105597 7f23bd1fa700 -1
      common/HeartbeatMap.cc: In function 'bool
      ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const
      char*, time_t)' thread 7f23bd1fa700 time 2017-05-21
      12:02:26.050994

      common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide
      timeout")

      
       ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)

       1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
      const*)+0x8b) [0x5620cbbb56db]

       2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
      char const*, long)+0x2b1) [0x5620cbafba91]

       3: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x5620cbafc256]

       4: (OSD::handle_osd_ping(MOSDPing*)+0x8e2) [0x5620cb557752]

       5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x5620cb55899b]

       6: (DispatchQueue::fast_dispatch(Message*)+0x76) [0x5620cbc71906]

       7: (Pipe::reader()+0x1d38) [0x5620cbcaceb8]

       8: (Pipe::Reader::entry()+0xd) [0x5620cbcb4a0d]

       9: (()+0x8184) [0x7f244a0ea184]

       10: (clone()+0x6d) [0x7f2448215bed]

       NOTE: a copy of the executable, or `objdump -rdS
      <executable>` is needed to interpret this.

      
      2017-05-21 12:02:26.161763 7f23bd1fa700 -1 *** Caught signal
      (Aborted) **

       in thread 7f23bd1fa700 thread_name:ms_pipe_read

      
      On 11.05.2017 20:11, David Turner wrote:

    
      I honestly haven't investigated the command line
        structure that it would need, but that looks about what I'd
        expect.
      

        On Thu, May 11, 2017, 7:58 AM Anton Dmitriev <tech@xxxxxxxxxx>
          wrote:

        
            I`m on
              Jewel 10.2.7

              Do you mean this:

              ceph-objectstore-tool --data-path
              /var/lib/ceph/osd/ceph-${osd_num} --journal-path
              /var/lib/ceph/osd/ceph-${osd_num}/journal
              --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
              --op apply-layout-settings --pool default.rgw.buckets.data
              --debug

              
              ?

              And before running it I need to stop OSD and flush its
              journal
          
          
              On 11.05.2017 14:52, David Turner wrote:

            
              If you are on the current release of Ceph
                Hammer 0.94.10 or Jewel 10.2.7, you have it already. I
                don't remember which release it came out in, but it's
                definitely in the current releases..
              

                On Thu, May 11, 2017, 12:24 AM Anton
                  Dmitriev <tech@xxxxxxxxxx>
                  wrote:

                
                    "recent
                      enough version of the ceph-objectstore-tool" -
                      sounds very interesting. Would it be released in
                      one of next Jewel minor releases?
                  
                  
                      On 10.05.2017 19:03, David Turner wrote:

                    
                      PG subfolder splitting is the
                        primary reason people are going to be deploying
                        Luminous and Bluestore much faster than any
                        other major release of Ceph.  Bluestore removes
                        the concept of subfolders in PGs.
                        

                        I have had clusters that reached what
                          seemed a hardcoded maximum of 12,800 objects
                          in a subfolder.  It would take an
                          osd_heartbeat_grace of 240 or 300 to let them
                          finish splitting their subfolders without
                          being marked down.  Recently I came across a
                          cluster that had a setting of 240 objects per
                          subfolder before splitting, so it was
                          splitting all the time, and several of the
                          OSDs took longer than 30 seconds to finish
                          splitting into subfolders.  That led to more
                          problems as we started adding backfilling to
                          everything and we lost a significant amount of
                          throughput on the cluster.
                        

                        I have yet to manage a cluster with a
                          recent enough version of the
                          ceph-objectstore-tool (hopefully I'll have one
                          this month) that includes the ability to take
                          an osd offline, split the subfolders, then
                          bring it back online.  If you set up a way to
                          monitor how big your subfolders are getting,
                          you can leave the ceph settings as high as you
                          want, and then go in and perform maintenance
                          on your cluster 1 failure domain at a time
                          splitting all of the PG subfolders on the
                          OSDs.  This approach would remove this ever
                          happening in the wild.
                        

                            On Wed, May 10, 2017 at 5:37
                              AM Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>
                              wrote:

                            
                            It is
                              difficult for me to clearly state why some
                              PGs have not been migrated.

                              crushmap settings? Weight of OSD?

                              
                              One thing is certain - you will not find
                              any information about the split

                              process in the logs ...

                              
                              pn

                              
                              -----Original Message-----

                              From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]

                              Sent: Wednesday, May 10, 2017 10:14 AM

                              To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;

                              ceph-users@xxxxxxxxxxxxxx

                              Subject: Re:  All OSD fails
                              after few requests to RGW

                              
                              When I created cluster, I made a mistake
                              in configuration, and set split

                              parameter to 32 and merge to 40, so
                              32*40*16 = 20480 files per folder.

                              After that I changed split to 8, and
                              increased number of pg and pgp from

                              2048 to 4096 for pool, where problem
                              occurs. While it was backfilling I

                              observed, that placement groups were
                              backfilling from one set of 3 OSD to

                              another set of 3 OSD (replicated size =
                              3), so I made a conclusion, that PGs

                              are completely recreating while increasing
                              PG and PGP for pool and after

                              this process number of files per directory
                              must be Ok. But when backfilling

                              finished I found many directories in this
                              pool with ~20

                              000 files. Why Increasing PG num did not
                              helped? Or maybe after this process

                              some files will be deleted with some
                              delay?

                              
                              I couldn`t find any information about
                              directory split process in logs, also

                              with osd and filestore debug 20. What
                              pattern and in what log I need to grep

                              for finding it?

                              
                              On 10.05.2017 10:36, Piotr Nowosielski
                              wrote:

                              > You can:

                              > - change these parameters and use
                              ceph-objectstore-tool

                              > - add OSD host - rebuild the cluster
                              will reduce the number of files

                              > in the directories

                              > - wait until "split" operations are
                              over ;-)

                              >

                              > In our case, we could afford to wait
                              until the "split" operation is

                              > over (we have 2 clusters in slightly
                              different configurations storing

                              > the same data)

                              >

                              > hint:

                              > When creating a new pool, use the
                              parameter "expected_num_objects"

                              > https://www.suse.com/documentation/ses-4/book_storage_admin/data/ceph_

                              > pools_operate.html

                              >

                              > Piotr Nowosielski

                              > Senior Systems Engineer

                              > Zespół Infrastruktury 5

                              > Grupa Allegro sp. z o.o.

                              > Tel: +48 512 08 55 92

                              >

                              >

                              > -----Original Message-----

                              > From: Anton Dmitriev [mailto:tech@xxxxxxxxxx]

                              > Sent: Wednesday, May 10, 2017 9:19 AM

                              > To: Piotr Nowosielski <piotr.nowosielski@xxxxxxxxxxxxxxxx>;

                              > ceph-users@xxxxxxxxxxxxxx

                              > Subject: Re:  All OSD
                              fails after few requests to RGW

                              >

                              > How did you solved it? Set new
                              split/merge thresholds, and manually

                              > applied it by ceph-objectstore-tool
                              --data-path

                              > /var/lib/ceph/osd/ceph-${osd_num}
                              --journal-path

                              >
                              /var/lib/ceph/osd/ceph-${osd_num}/journal

                              >
                              --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
                              --op

                              > apply-layout-settings --pool
                              default.rgw.buckets.data

                              >

                              > on each OSD?

                              >

                              > How I can see in logs, that split
                              occurs?

                              >

                              > On 10.05.2017 10:13, Piotr
                              Nowosielski wrote:

                              >> Hey,

                              >> We had similar problems. Look for
                              information on "Filestore merge and

                              >> split".

                              >>

                              >> Some explain:

                              >> The OSD, after reaching a certain
                              number of files in the directory

                              >> (it depends of 'filestore merge
                              threshold' and 'filestore split multiple'

                              >> parameters) rebuilds the
                              structure of this directory.

                              >> If the files arrives, the OSD
                              creates new subdirectories and moves

                              >> some of the files there.

                              >> If the files are missing the OSD
                              will reduce the number of

                              >> subdirectories.

                              >>

                              >>

                              >> --

                              >> Piotr Nowosielski

                              >> Senior Systems Engineer

                              >> Zespół Infrastruktury 5

                              >> Grupa Allegro sp. z o.o.

                              >> Tel: +48 512 08 55 92

                              >>

                              >> Grupa Allegro Sp. z o.o. z
                              siedzibą w Poznaniu, 60-166 Poznań, przy
                              ul.

                              >> Grunwaldzka 182, wpisana do
                              rejestru przedsiębiorców prowadzonego

                              >> przez Sąd Rejonowy Poznań - Nowe
                              Miasto i Wilda, Wydział VIII

                              >> Gospodarczy Krajowego Rejestru
                              Sądowego pod numerem KRS 0000268796, o

                              >> kapitale zakładowym w wysokości
                              33 976 500,00 zł, posiadająca numer

                              >> identyfikacji podatkowej NIP:
                              5272525995.

                              >>

                              >>

                              >>

                              >> -----Original Message-----

                              >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
                              On Behalf

                              >> Of Anton Dmitriev

                              >> Sent: Wednesday, May 10, 2017
                              8:14 AM

                              >> To: ceph-users@xxxxxxxxxxxxxx

                              >> Subject: Re:  All OSD
                              fails after few requests to RGW

                              >>

                              >> Hi!

                              >>

                              >> I increased pg_num and pgp_num
                              for pool default.rgw.buckets.data from

                              >> 2048 to 4096, and it seems that
                              situation became a bit better,

                              >> cluster dies after 20-30 PUTs,
                              not after 1. Could someone please give

                              >> me some recommendations how to
                              rescue the cluster?

                              >>

                              >> On 27.04.2017 09:59, Anton
                              Dmitriev wrote:

                              >>> Cluster was going well for a
                              long time, but on the previous week

                              >>> osds start to fail.

                              >>> We use cluster like image
                              storage for Opennebula with small load and

                              >>> like object storage with high
                              load.

                              >>> Sometimes disks of some osds
                              utlized by 100 %, iostat shows avgqu-sz

                              >>> over 1000, while reading or
                              writing a few kilobytes in a second,

                              >>> osds on this disks become
                              unresponsive and cluster marks them down.

                              >>> We lower the load to object
                              storage and situation became better.

                              >>>

                              >>> Yesterday situation became
                              worse:

                              >>> If RGWs are disabled and
                              there is no requests to object storage

                              >>> cluster performing well, but
                              if enable RGWs and make a few PUTs or

                              >>> GETs all not SSD osds on all
                              storages become in the same situation,

                              >>> described above.

                              >>> IOtop shows, that
                              xfsaild/<disk> burns disks.

                              >>>

                              >>> trace-cmd record -e xfs\* 
                              for a 10 seconds shows 10 milion objects,

                              >>> as i understand it means ~360
                              000 objects to push per one osd for a

                              >>> 10 seconds

                              >>>      $ wc -l t.t

                              >>> 10256873 t.t

                              >>>

                              >>> fragmentation on one of such
                              disks is about 3%

                              >>>

                              >>> more information about
                              cluster:

                              >>>

                              >>> https://yadi.sk/d/Y63mXQhl3HPvwt

                              >>>

                              >>> also debug logs for osd.33
                              while problem occurs

                              >>>

                              >>> https://yadi.sk/d/kiqsMF9L3HPvte

                              >>>

                              >>> debug_osd = 20/20

                              >>> debug_filestore = 20/20

                              >>> debug_tp = 20/20

                              >>>

                              >>>

                              >>>

                              >>> Ubuntu 14.04

                              >>> $ uname -a

                              >>> Linux storage01
                              4.2.0-42-generic #49~14.04.1-Ubuntu SMP
                              Wed Jun 29

                              >>> 20:22:11 UTC 2016 x86_64
                              x86_64 x86_64 GNU/Linux

                              >>>

                              >>> Ceph 10.2.7

                              >>>

                              >>> 7 storages: Supermicro 28 osd
                              4tb 7200 JBOD + journal raid10 4 ssd

                              >>> intel 3510 800gb + 2 osd SSD
                              intel 3710 400gb for rgw meta and index

                              >>> One of this storages differs
                              only in number of osd, it has 26 osd on

                              >>> 4tb, instead of 28 on others

                              >>>

                              >>> Storages connect to each
                              other by bonded 2x10gbit Clients connect
                              to

                              >>> storages by bonded 2x1gbit

                              >>>

                              >>> in 5 storages 2 x CPU
                              E5-2650v2  and 256 gb RAM in 2 storages 2
                              x

                              >>> CPU

                              >>> E5-2690v3  and 512 gb RAM

                              >>>

                              >>> 7 mons

                              >>> 3 rgw

                              >>>

                              >>> Help me please to rescue the
                              cluster.

                              >>>

                              >>>

                              >> --

                              >> Dmitriev Anton

                              >>

                              >>
                              _______________________________________________

                              >> ceph-users mailing list

                              >> ceph-users@xxxxxxxxxxxxxx

                              >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                              >

                              > --

                              > Dmitriev Anton

                              
                              --

                              Dmitriev Anton

_______________________________________________

                              ceph-users mailing list

                              ceph-users@xxxxxxxxxxxxxx

                              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                            
                    -- 
Dmitriev Anton
                  
                
            -- 
Dmitriev Anton
          
        
    -- 
Dmitriev Anton
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com