Re: Luminous RC feedback - device classes and osd df weirdness

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 20/07/17 13:14, Sage Weil wrote:

On Thu, 20 Jul 2017, Mark Kirkwood wrote:
On 20/07/17 10:46, Mark Kirkwood wrote:
On 20/07/17 02:53, Sage Weil wrote:
On Wed, 19 Jul 2017, Mark Kirkwood wrote:

One (I think) new thing compared to the 12.1.0 is that restarting the
services
blitzes the modified crushmap, and we get back to:

$ sudo ceph osd tree
ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
-1       0.32996 root default
-2       0.08199     host ceph1
   0   hdd 0.02399         osd.0       up  1.00000 1.00000
   4   hdd 0.05699         osd.4       up  1.00000 1.00000
-3       0.08299     host ceph2
   1   hdd 0.02399         osd.1       up  1.00000 1.00000
   5   hdd 0.05899         osd.5       up  1.00000 1.00000
-4       0.08199     host ceph3
   2   hdd 0.02399         osd.2       up  1.00000 1.00000
   6   hdd 0.05699         osd.6       up  1.00000 1.00000
-5       0.08299     host ceph4
   3   hdd 0.02399         osd.3       up  1.00000 1.00000
   7   hdd 0.05899         osd.7       up  1.00000 1.00000

...and all the PG are remapped again. Now I might have just missed this
happening with 12.1.0 - but I'm (moderately) confident that I did
restart
stuff and not see this happening. For now I've added:

osd crush update on start = false

to my ceph.conf to avoid being caught by this.
Actually setting the above does *not* prevent the crushmap getting changed.

Can you share teh output of 'ceph osd metadata 0' vs 'cpeh osd metadata
4'?  I'm not sure why it's getting the class wrong.  I haven't seen this
on my cluster (it's bluestore; maybe that's the difference).


Yes, and it is quite interesting: osd 0 is filestore on hdd, osd 4 is
bluestore on ssd but (see below) the metadata suggests ceph thinks it is hdd
(the fact that the hosts are VMs might be not helping here):

$ sudo ceph osd metadata 0
{
     "id": 0,
     "arch": "x86_64",
     "back_addr": "192.168.122.21:6806/1712",
     "backend_filestore_dev_node": "unknown",
     "backend_filestore_partition_path": "unknown",
     "ceph_version": "ceph version 12.1.1
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
     "cpu": "QEMU Virtual CPU version 1.7.0",
     "distro": "ubuntu",
     "distro_description": "Ubuntu 16.04.2 LTS",
     "distro_version": "16.04",
     "filestore_backend": "xfs",
     "filestore_f_type": "0x58465342",
     "front_addr": "192.168.122.21:6805/1712",
     "hb_back_addr": "192.168.122.21:6807/1712",
     "hb_front_addr": "192.168.122.21:6808/1712",
     "hostname": "ceph1",
     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
     "kernel_version": "4.4.0-83-generic",
     "mem_swap_kb": "1047548",
     "mem_total_kb": "2048188",
     "os": "Linux",
     "osd_data": "/var/lib/ceph/osd/ceph-0",
     "osd_journal": "/var/lib/ceph/osd/ceph-0/journal",
     "osd_objectstore": "filestore",
     "rotational": "1"
}

$ sudo ceph osd metadata 4
{
     "id": 4,
     "arch": "x86_64",
     "back_addr": "192.168.122.21:6802/1488",
     "bluefs": "1",
     "bluefs_db_access_mode": "blk",
     "bluefs_db_block_size": "4096",
     "bluefs_db_dev": "253:32",
     "bluefs_db_dev_node": "vdc",
     "bluefs_db_driver": "KernelDevice",
     "bluefs_db_model": "",
     "bluefs_db_partition_path": "/dev/vdc2",
     "bluefs_db_rotational": "1",
     "bluefs_db_size": "63244840960",
     "bluefs_db_type": "hdd",
     "bluefs_single_shared_device": "1",
     "bluestore_bdev_access_mode": "blk",
     "bluestore_bdev_block_size": "4096",
     "bluestore_bdev_dev": "253:32",
     "bluestore_bdev_dev_node": "vdc",
     "bluestore_bdev_driver": "KernelDevice",
     "bluestore_bdev_model": "",
     "bluestore_bdev_partition_path": "/dev/vdc2",
     "bluestore_bdev_rotational": "1",
     "bluestore_bdev_size": "63244840960",
     "bluestore_bdev_type": "hdd",
     "ceph_version": "ceph version 12.1.1
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
     "cpu": "QEMU Virtual CPU version 1.7.0",
     "distro": "ubuntu",
     "distro_description": "Ubuntu 16.04.2 LTS",
     "distro_version": "16.04",
     "front_addr": "192.168.122.21:6801/1488",
     "hb_back_addr": "192.168.122.21:6803/1488",
     "hb_front_addr": "192.168.122.21:6804/1488",
     "hostname": "ceph1",
     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
     "kernel_version": "4.4.0-83-generic",
     "mem_swap_kb": "1047548",
     "mem_total_kb": "2048188",
     "os": "Linux",
     "osd_data": "/var/lib/ceph/osd/ceph-4",
     "osd_journal": "/var/lib/ceph/osd/ceph-4/journal",
     "osd_objectstore": "bluestore",
     "rotational": "1"
}


I note that /sys/block/vdc/queue/rotational is 1 , so this looks like libvirt
is being dense about the virtual disk...if I cat '0' into the file then the
osd restarts *do not* blitz the crushmap anymore - so it looks the previous
behaviour is brought on by my use of VMs - I'll try mashing it with a udev
rule to get the 0 in there :-)

It is possibly worthy of a doco note about how this detection works at the
Ceph level, just in case there are some weird SSD firmwares out there that
result in the flag being set wrong in bare metal environments.
Yeah.  Note that there is also a PR in flight streamlining some of the
device class code that could potentially make the class more difficult to
change, specifically to avoid situations like this.  For example, if the
class is already set, the update on OSD start could be a no-op (do not
change), and only set it if there is no class at all.  To change, you
would (from the cli), 'ceph osd crush rm-device-class osd.0' and then
'ceph osd crush set-device-class osd.0' (or restart the osd, or perhaps
pass a --force flag to set-device-class.  Does that seem like a reasonable
path?

I'm mostly worried about future changes to our auto-detect class logic.
If we change anything (intentionally or not) we don't want to trigger a
ton of data rebalancing on OSD restart because the class changes from,
say, 'ssd' to 'nvme' due to improved detection logic.



Yeah, some resistance to change once a class is set sounds like a good plan!

regards

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux