Re: Jewel upgrade and feature set mismatch

Shain Miley <smiley@xxxxxxx> · Wed, 24 May 2017 10:27:54 -0400

Hi,

Thanks for all your help so far...very useful information indeed.

Here is the debug output from the file you referenced below:

root@rbd1:/sys/kernel/debug/ceph/504b5794-34bd-44e7-a8c3-0494cf800c23.client67751889# 
cat osdc
2311    osd144  3.1347f3bc 
rb.0.25f2ab0.238e1f29.000000000000              read
14216   osd65   3.bd82049c 
rb.0.1ae3061.238e1f29.000000000000              read
14391   osd44   3.875890a0 rb.0.fe307e.238e1f29.000000393889 
set-alloc-hint,write
14560   osd61   3.1ab27784 rb.0.17d451c.238e1f29.000000131308 
set-alloc-hint,write
14561   osd33   3.cc377593 rb.0.e411a0.238e1f29.0000001e007b 
set-alloc-hint,write
14568   osd192  3.1b4f6fbd rb.0.113e639.238e1f29.000000393a11 
set-alloc-hint,write
15319   osd192  3.b61f59fd      npr_archive_library_img.rbd 
942122'299183126872064  watch
15320   osd100  3.2d0fc3c8      npr_archive_music_img.rbd 
365920'299183126872064  watch
15321   osd108  3.93b6741d      npr_archive_multimedia_img.rbd 
836232'299183126872064  watch
15322   osd64   3.27bf5fe npr_archive_online_production_img.rbd   
945218'299183126872064 watch
15323   osd154  3.1ca3def1      npr_archive_design_img.rbd 
359827'299183126872064  watch
15324   osd161  3.edeaca14      npr_archive_orpheus_img.rbd 
871904'299183126872064  watch

Do you think those 4 write operations are enough to make me think twice 
about a reboot?

Thanks again,

Shain

On 05/24/2017 08:31 AM, Ilya Dryomov wrote:
On Wed, May 24, 2017 at 1:47 PM, Shain Miley <SMiley@xxxxxxx> wrote:
Hello,
We just upgraded from Hammer to Jewel, and after the cluster once again
reported a healthy state I set the crush tunables to ‘optimal’ (from
legacy).
12 hours later and the cluster is almost done with the pg remapping under
the new rules.

The issue I am having is the server where we mount the krbd images is
showing errors in the kern.log:

May 24 07:28:14 rbd1 kernel: [5600763.226208] libceph: osd192
10.35.1.235:6844 feature set mismatch, my 2b84a042a42 < server's
40002b84a042a42, missing 400000000000000

And I can no longer list any of the mounted filesystems or unmap the rbd
images, etc.

My options seem to be:

1)set the tunables back to legacy and see if the rbd server starts
responding.

2)upgrade the kernel on the rbd server to at least version 4.5 (currently
using 3.18 on Ubuntu 14.04).
As per [1], I'd recommend upgrading to 4.9.z.

3)disable some features on our current images?
This is the "cluster" feature bit, not the image feature.  Don't enable
new image features just because they are there in jewel though ;)

I would like to try option 2 first…but I am wondering if is safe to reboot
the server with the rbd images still mapped…is there any chance of data loss
from an rbd image getting corrupted?
Take a look at /sys/kernel/debug/ceph/*/osdc.  If it's empty, there are
no in-flight requests and you should be able to cold reboot safely.  If
there is a lot of pending requests, the safest option is to revert the
tunables setting.

[1] http://docs.ceph.com/docs/master/start/os-recommendations/

Thanks,

                 Ilya

--
NPR | Shain Miley | Manager of Infrastructure, Digital Media | smiley@xxxxxxx | 202.513.3649

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com