Re: ONE pg deep-scrub blocks cluster

c <ceph@xxxxxxxxxx> · Sun, 31 Jul 2016 00:37:50 +0200

Am 2016-07-30 14:04, schrieb Marius Vaitiekunas:
Hi,

We had a similar issue. If you use radosgw and have large buckets,
this pg could hold a bucket index. 

Hello Marius,

thanks for your hint.

But, it seems that i forgot to mention that we are using ceph only as 
rbd for our virtual machines for now.
So no radosgw for now.

- Mehmet

On Friday, 29 July 2016, c <ceph@xxxxxxxxxx> wrote:

Hi Christian,
Hello Bill,

thank you very much for your Post.

For the record, this ONLY happens with this PG and no others that
share
the same OSDs, right?

Yes, right.

If so then we're looking at something (HDD or FS wise) that's
specific to
the data of this PG.

When doing the deep-scrub, monitor (atop, etc) all 3 nodes and
see if a
particular OSD (HDD) stands out, as I would expect it to.

Now I logged all disks via atop each 2 seconds while the deep-scrub
was running ( atop -w osdXX_atop 2 ).
As you expected all disks was 100% busy - with constant 150MB
(osd.4), 130MB (osd.28) and 170MB (osd.16)...

- osd.4 (/dev/sdf) http://slexy.org/view/s21emd2u6j [1]
- osd.16 (/dev/sdm): http://slexy.org/view/s20vukWz5E [2]
- osd.28 (/dev/sdh): http://slexy.org/view/s20YX0lzZY [3]

You can have a look on this logs via "atop -r FILE" and jump to the
time when you press "b" and type "17:12:31".
With "t" you can "walk" forward and "T" backward through the
logfile.

But what is causing this? A deep-scrub on all other disks - same
model and ordered at the same time - seems to not have this issue.

Bill<
Removing osd.4 and still getting the scrub problems removes its
drive from consideration as the culprit.  Try the same thing
again for osd.16 and then osd.28.

Christian<
Since you already removed osd.4 with the same result, continue to
cycle through the other OSDs.
Running a fsck on the (out) OSDs might be helpful, too.

Next week, I will do this

1.1 Remove osd.4 completely from Ceph - again (the actual primary
for PG 0.223)
1.2 xfs_repair -n /dev/sdf1 (osd.4): to see possible error
1.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"

2.1 Remove osd.16 completely from Ceph
2.2 xfs_repair -n /dev/sdm1
2.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"

3.1 Remove osd.16 completely from Ceph
3.2 xfs_repair -n /dev/sdm1
3.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"

smartctl may not show anything out of sorts until the marginally
bad sector or sectors finally goes bad and gets remapped.  The
only hint may be buried in the raw read error rate, seek error
rate or other error counts like ecc or crc errors.  The long test
you are running may or may not show any new information.

I will write you again next week when I have done the tests above.

- Mehmet

Am 2016-07-29 03:05, schrieb Christian Balzer:
Hello,

On Thu, 28 Jul 2016 14:46:58 +0200 c wrote:

Hello Ceph alikes :)

i have a strange issue with one PG (0.223) combined with
"deep-scrub".

Always when ceph - or I manually - run a " ceph pg deep-scrub 0.223
",
this leads to many "slow/block requests" so that nearly all of my
VMs
stop working for a while.

For the record, this ONLY happens with this PG and no others that
share
the same OSDs, right?

If so then we're looking at something (HDD or FS wise) that's
specific to
the data of this PG.

When doing the deep-scrub, monitor (atop, etc) all 3 nodes and see
if a
particular OSD (HDD) stands out, as I would expect it to.

Since you already removed osd.4 with the same result, continue to
cycle
through the other OSDs.
Running a fsck on the (out) OSDs might be helpful, too.

Christian

This happens only to this one PG 0.223 and in combination with
deep-scrub (!). All other Placement Groups where a deep-scrub
occurs are
fine. The mentioned PG also works fine when a "normal scrub"
occurs.

These OSDs are involved:

#> ceph pg map 0.223
osdmap e7047 pg 0.223 (0.223) -> up [4,16,28] acting [4,16,28]

*The LogFiles*

"deep-scrub" starts @ 2016-07-28 12:44:00.588542 and takes
approximately
12 Minutes (End: 2016-07-28 12:56:31.891165)
- ceph.log: http://pastebin.com/FSY45VtM [4]

I have done " ceph tell osd injectargs '--debug-osd = 5/5' " for
the
related OSDs 4,16 and 28

LogFile - osd.4
- ceph-osd.4.log: http://slexy.org/view/s20zzAfxFH [5]

LogFile - osd.16
- ceph-osd.16.log: http://slexy.org/view/s25H3Zvkb0 [6]

LogFile - osd.28
- ceph-osd.28.log: http://slexy.org/view/s21Ecpwd70 [7]

I have checked the disks 4,16 and 28 with smartctl and could not
any
issues - also there are no odd "dmesg" messages.

*ceph -s*
     cluster 98a410bf-b823-47e4-ad17-4543afa24992
      health HEALTH_OK
      monmap e2: 3 mons at

{monitor1=172.16.0.2:6789/0,monitor3=172.16.0.4:6789/0,monitor2=172.16.0.3:6789/0
[8]}
             election epoch 38, quorum 0,1,2
monitor1,monitor2,monitor3
      osdmap e7047: 30 osds: 30 up, 30 in
             flags sortbitwise
       pgmap v3253519: 1024 pgs, 1 pools, 2858 GB data, 692
kobjects
             8577 GB used, 96256 GB / 102 TB avail
                 1024 active+clean
   client io 396 kB/s rd, 3141 kB/s wr, 55 op/s rd, 269 op/s wr

This is my Setup:

*Software/OS*

- Jewel
#> ceph tell osd.* version | grep version | uniq
"version": "ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)"
#> ceph tell mon.* version
[...] ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)

- Ubuntu 16.04 LTS on all OSD and MON Server
#> uname -a
Linux galawyn 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12
UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

*Server*

3x OSD Server, each with
- 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no
Hyper-Threading
- 64GB RAM
- 10x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs
- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device
for
10-12 Disks
- 1x Samsung SSD 840/850 Pro only for the OS

3x MON Server
- Two of them with 1x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz (4
Cores, 8 Threads)
- The third one has 2x Intel(R) Xeon(R) CPU L5430  @ 2.66GHz ==> 8
Cores, no Hyper-Threading
- 32 GB RAM
- 1x Raid 10 (4 Disks)

*Network*
- Each Server and Client has an active connection @ 1x 10GB; A
second
connection is also connected via 10GB but provides only a Backup
connection when the active Switch fails - no LACP possible.
- We do not use Jumbo Frames yet..
- Public and Cluster-Network related Ceph traffic is going through
this
one active 10GB Interface on each Server.

Any ideas what is going on?
Can I provide more input to find a solution?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [9]
 _______________________________________________
 ceph-users mailing list
 ceph-users@xxxxxxxxxxxxxx
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [9]

--

Marius Vaitiekūnas

Links:
------
[1] http://slexy.org/view/s21emd2u6j
[2] http://slexy.org/view/s20vukWz5E
[3] http://slexy.org/view/s20YX0lzZY
[4] http://pastebin.com/FSY45VtM
[5] http://slexy.org/view/s20zzAfxFH
[6] http://slexy.org/view/s25H3Zvkb0
[7] http://slexy.org/view/s21Ecpwd70
[8]
http://172.16.0.2:6789/0,monitor3=172.16.0.4:6789/0,monitor2=172.16.0.3:6789/0
[9] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com