Re: Jewel 10.2.2 - Error when flushing journal

Mehmet <ceph@xxxxxxxxxx> · Fri, 09 Sep 2016 10:24:38 +0200

Hello Alexey,

thank you for your mail - my answers inline :)

Am 2016-09-08 16:24, schrieb Alexey Sheplyakov:
Hi,

root@:~# ceph-osd -i 12 --flush-journal
 > SG_IO: questionable sense data, results may be incorrect
 > SG_IO: questionable sense data, results may be incorrect

As far as I understand these lines is a hdparm warning (OSD uses
hdparm command to query the journal device write cache state).

The message means hdparm is unable to reliably figure out if the drive
write cache is enabled. This might indicate a hardware problem.

I guess this has to do with the the NVMe-Device (Intel DC P3700 NVMe) 
which is used for journaling.
And so.. a normal behavior?

ceph-osd -i 12 --flush-journal

I think it's a good idea to
a) check the journal drive (smartctl),

The disks are all fine - checked 2-3 weeks before.

b) capture a more verbose log,

i.e. add this to ceph.conf

[osd]
debug filestore = 20/20
debug journal = 20/20

and try flushing the journal once more (note: this won't fix the
problem, the point is to get a useful log)

I have flushed the the journal @ ~09:55:26 today and got this lines

root@:~# ceph-osd -i 10 --flush-journal
SG_IO: questionable sense data, results may be incorrect
SG_IO: questionable sense data, results may be incorrect
*** Caught signal (Segmentation fault) **
 in thread 7f38a2ecf700 thread_name:ceph-osd
 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x560356296dde]
 2: (()+0x113d0) [0x7f38a81b03d0]
 3: [0x560360f79f00]
2016-09-09 09:55:26.446925 7f38a2ecf700 -1 *** Caught signal 
(Segmentation fault) **
 in thread 7f38a2ecf700 thread_name:ceph-osd

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x560356296dde]
 2: (()+0x113d0) [0x7f38a81b03d0]
 3: [0x560360f79f00]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

     0> 2016-09-09 09:55:26.446925 7f38a2ecf700 -1 *** Caught signal 
(Segmentation fault) **
 in thread 7f38a2ecf700 thread_name:ceph-osd

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x560356296dde]
 2: (()+0x113d0) [0x7f38a81b03d0]
 3: [0x560360f79f00]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

Segmentation fault

This is the actual logfile for osd.10
- http://slexy.org/view/s21lhpkLGQ

By the way:
I have done "ceph osd set noout" before stop and flushing.

Hope this is useful for you!

- Mehmet

Best regards,
      Alexey

On Wed, Sep 7, 2016 at 6:48 PM, Mehmet <ceph@xxxxxxxxxx> wrote:

Hey again,

now i have stopped my osd.12 via

root@:~# systemctl stop ceph-osd@12

and when i am flush the journal...

root@:~# ceph-osd -i 12 --flush-journal
SG_IO: questionable sense data, results may be incorrect
SG_IO: questionable sense data, results may be incorrect
*** Caught signal (Segmentation fault) **
 in thread 7f421d49d700 thread_name:ceph-osd
 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x564545e65dde]
 2: (()+0x113d0) [0x7f422277e3d0]
 3: [0x56455055a3c0]
2016-09-07 17:42:58.128839 7f421d49d700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f421d49d700 thread_name:ceph-osd

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x564545e65dde]
 2: (()+0x113d0) [0x7f422277e3d0]
 3: [0x56455055a3c0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

     0> 2016-09-07 17:42:58.128839 7f421d49d700 -1 *** Caught
signal (Segmentation fault) **
 in thread 7f421d49d700 thread_name:ceph-osd

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x564545e65dde]
 2: (()+0x113d0) [0x7f422277e3d0]
 3: [0x56455055a3c0]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

Segmentation fault

The logfile with further information
- http://slexy.org/view/s2T8AohMfU [4]

I guess i will get same message when i flush the other journals.

- Mehmet

Am 2016-09-07 13:23, schrieb Mehmet:

Hello ceph people,

yesterday i stopped one of my OSDs via

root@:~# systemctl stop ceph-osd@10

and tried to flush the journal for this osd via

root@:~# ceph-osd -i 10 --flush-journal

but getting this output on the screen:

SG_IO: questionable sense data, results may be incorrect
SG_IO: questionable sense data, results may be incorrect
*** Caught signal (Segmentation fault) **
 in thread 7fd846333700 thread_name:ceph-osd
 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x55f33b862dde]
 2: (()+0x113d0) [0x7fd84b6143d0]
 3: [0x55f345bbff80]
2016-09-06 22:12:51.850739 7fd846333700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7fd846333700 thread_name:ceph-osd

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x55f33b862dde]
 2: (()+0x113d0) [0x7fd84b6143d0]
 3: [0x55f345bbff80]
 NOTE: a copy of the executable, or `objdump -rdS <executable>`
is
needed to interpret this.

     0> 2016-09-06 22:12:51.850739 7fd846333700 -1 *** Caught
signal
(Segmentation fault) **
 in thread 7fd846333700 thread_name:ceph-osd

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x96bdde) [0x55f33b862dde]
 2: (()+0x113d0) [0x7fd84b6143d0]
 3: [0x55f345bbff80]
 NOTE: a copy of the executable, or `objdump -rdS <executable>`
is
needed to interpret this.

Segmentation fault

This is the logfile from my osd.10 with further informations
- http://slexy.org/view/s21tfwQ1fZ [1]

Today i stopped another OSD (osd.11)

root@:~# systemctl stop ceph-osd@11

I did not not get the above mentioned error - but this

root@:~# ceph-osd -i 11 --flush-journal
SG_IO: questionable sense data, results may be incorrect
SG_IO: questionable sense data, results may be incorrect
2016-09-07 13:19:39.729894 7f3601a298c0 -1 flushed journal
/var/lib/ceph/osd/ceph-11/journal for object store
/var/lib/ceph/osd/ceph-11

This is the logfile from my osd.11 with further informations
- http://slexy.org/view/s2AlEhV38m [2]

This is not realy a case actualy cause i will setup the journal
partitions again with 20GB (from 5GB actual) an bring the OSD
then
bring up again.
But i thought i should mail this error to the mailing list.

This is my Setup:

*Software/OS*
- Jewel
#> ceph tell osd.* version | grep version | uniq
"version": "ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)"

#> ceph tell mon.* version
[...] ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)

- Ubuntu 16.04 LTS on all OSD and MON Server
#> uname -a
31.08.2016: Linux reilif 4.4.0-36-generic #55-Ubuntu SMP Thu Aug
11
18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

*Server*
3x OSD Server, each with

- 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no
Hyper-Threading

- 64GB RAM
- 10x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs

- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling
Device
for 10-12 Disks

- 1x Samsung SSD 840/850 Pro only for the OS

3x MON Server
- Two of them with 1x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
(4
Cores, 8 Threads) - The third one has 2x Intel(R) Xeon(R) CPU
L5430 @
2.66GHz ==> 8 Cores, no Hyper-Threading

- 32 GB RAM
- 1x Raid 10 (4 Disks)

*Network*
- Actualy each Server and Client has on active connection @ 1x
1GB; In
Short this will be changed to 2x 10GB Fibre perhaps with LACP
when
possible.

- We do not use Jumbo Frames yet..

- Public and Cluster-Network related Ceph traffic is actualy
going
through this one active 1GB Interface on each Server.

hf
- Mehmet
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]

Links:
------
[1] http://slexy.org/view/s21tfwQ1fZ
[2] http://slexy.org/view/s2AlEhV38m
[3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[4] http://slexy.org/view/s2T8AohMfU
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com