Nova fails to download image from Glance backed with Ceph

Vasiliy Angapov <angapov@xxxxxxxxx> · Fri, 4 Sep 2015 17:56:15 +0800

Hi all,

Not sure actually where does this bug belong to - OpenStack or Ceph -
but writing here in humble hope that anyone faced that issue also.

I configured test OpenStack instance with Glance images stored in Ceph
0.94.3. Nova has local storage.
But when I'm trying to launch instance from large image stored in Ceph
- it fails to spawn with such an error in nova-conductor.log:

2015-09-04 11:52:35.076 3605449 ERROR nova.scheduler.utils
[req-c6af3eca-f166-45bd-8edc-b8cfadeb0d0b
82c1f134605e4ee49f65015dda96c79a 448cc6119e514398ac2793d043d4fa02 - -
-] [instance: 18c9f1d5-50e8-426f-94d5-167f43129ea6] Error from last
host: slpeah005 (node slpeah005.cloud): [u'Traceback (most recent call
last):\n', u'  File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2220,
in _do_build_and_run_instance\n    filter_properties)\n', u'  File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2363,
in _build_and_run_instance\n    instance_uuid=instance.uuid,
reason=six.text_type(e))\n', u'RescheduledException: Build of instance
18c9f1d5-50e8-426f-94d5-167f43129ea6 was re-scheduled: [Errno 32]
Corrupt image download. Checksum was 625d0686a50f6b64e57b1facbc042248
expected 4a7de2fbbd01be5c6a9e114df145b027\n']

So nova tries 3 different hosts with the same error messages on every
single one and then fails to spawn an instance.
I've tried Cirros little image and it works fine with it. Issue
happens with large images like 10Gb in size.
I also managed to look into /var/lib/nova/instances/_base folder and
found out that image is actually being downloaded but at some moment
the download process interrupts for some unknown reason and instance
gets deleted.

I looked at the syslog and found many messages like that:
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735094
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.22 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735099
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.23 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735104
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.24 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735108
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.26 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735118
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.27 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)

I've also tried to monitor nova-compute process file descriptors
number but it is never more than 102. ("echo
/proc/NOVA_COMPUTE_PID/fd/* | wc -w" like Jan advised in this ML).
It also seems like problem appeared only in 0.94.3, in 0.94.2
everything worked just fine!

Would be very grateful for any help!

Vasily.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com