On Mon, Oct 15, 2018 at 4:04 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > We turned on all the RBD v2 features while running Jewel; since then all clusters have been updated to Luminous 12.2.2 and additional clusters added that have never run Jewel. > > Today I find that a few percent of volumes in each cluster have issues, examples below. > > I'm concerned that these issues may present problems when using rbd-mirror to move volumes between clusters. Many instances involve heads or nodes of snapshot trees; it's possible but unverified that those not currently snap-related may have been in the past. > > In the Jewel days we retroactively applied fast-diff, object-map to existing volumes but did not bother with tombstones. > > Any thoughts on > > 1) How this happens? If you enabled object-map and/or fast-diff on pre-existing images, then the object-map is automatically flagged as invalid since just enabling the feature doesn't rebuild the object-map. This just instructs librbd clients not to trust the object-map so all optimizations are disabled. > 2) Is rbd object-map rebuild" always safe, especially on volumes that are in active use? Yes, the live-rebuild of the HEAD image is just proxied over to the current exclusive-lock owner. Rebuilds of any snapshot object-maps are performed by the rbd CLI. > 3) The disturbing messages spewed by `rbd ls` -- related or not? Some of the errors spewed by "rbd ls" are not specifically related to the object-map feature. For example, it appears that you have at least two cloned images where the parent image snapshot is no longer available (librbd::image::RefreshParentRequest: failed to locate snapshot). It also appears that at least two of the images in your RBD directory don't exist (librbd::image::OpenRequest: failed to retreive immutable metadata). However, for the "librbd::object_map::RefreshRequest: failed to load object map" logs, those are harmless if you enabled the object-map after the snapshot was created and haven't rebuilt the object map yet. > 4) Would this as I fear confound successful rbd-mirror migration? Nope -- rbd-mirror uses the journal for synchronization. > I've found http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012137.html that *seems* to indicate that a live rebuild is safe,but I'm still uncertain about the root cause, and if it's still happening. I've never ventured into this dark corner before so I'm being careful. > > All clients are QEMU/libvirt; most are 12.2.2 but there are some lingering Jewel, most likely 10.2.6 or perhaps 10.2.3. Eg: > > > # ceph features > { > "mon": { > "group": { > "features": "0x1ffddff8eea4fffb", > "release": "luminous", > "num": 5 > } > }, > "osd": { > "group": { > "features": "0x1ffddff8eea4fffb", > "release": "luminous", > "num": 983 > } > }, > "client": { > "group": { > "features": "0x7fddff8ee84bffb", > "release": "jewel", > "num": 15 > }, > "group": { > "features": "0x1ffddff8eea4fffb", > "release": "luminous", > "num": 3352 > } > } > } > > > # rbd ls -l |wc > 2018-10-05 20:55:17.397288 7f976cff9700 -1 librbd::image::RefreshParentRequest: failed to locate snapshot: Snapshot with this id not found > 2018-10-05 20:55:17.397334 7f976cff9700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (2) No such file or directory > 2018-10-05 20:55:17.397397 7f976cff9700 -1 librbd::image::OpenRequest: failed to refresh image: (2) No such file or directory > 2018-10-05 20:55:17.398025 7f976cff9700 -1 librbd::io::AioCompletion: 0x7f978667b570 fail: (2) No such file or directory > 2018-10-05 20:55:17.398075 7f976cff9700 -1 librbd::image::RefreshParentRequest: failed to locate snapshot: Snapshot with this id not found > 2018-10-05 20:55:17.398079 7f976cff9700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (2) No such file or directory > 2018-10-05 20:55:17.398096 7f976cff9700 -1 librbd::image::OpenRequest: failed to refresh image: (2) No such file or directory > 2018-10-05 20:55:17.398659 7f976cff9700 -1 librbd::io::AioCompletion: 0x7f978660c240 fail: (2) No such file or directory > 2018-10-05 20:55:30.416174 7f976cff9700 -1 librbd::io::AioCompletion: 0x7f9786cd5ee0 fail: (2) No such file or directory > 2018-10-05 20:55:34.083188 7f976d7fa700 -1 librbd::object_map::RefreshRequest: failed to load object map: rbd_object_map.b18d634146825.0000000000002d8f > 2018-10-05 20:55:34.084101 7f976cff9700 -1 librbd::object_map::InvalidateRequest: 0x7f97544d11e0 should_complete: r=0 > 2018-10-05 20:55:38.597014 7f976d7fa700 -1 librbd::image::OpenRequest: failed to retreive immutable metadata: (2) No such file or directory > 2018-10-05 20:55:38.597109 7f976cff9700 -1 librbd::io::AioCompletion: 0x7f9786d3a7c0 fail: (2) No such file or directory > 2018-10-05 20:55:51.584101 7f976d7fa700 -1 librbd::object_map::RefreshRequest: failed to load object map: rbd_object_map.c447c403109b2.0000000000006a04 > 2018-10-05 20:55:51.592616 7f976cff9700 -1 librbd::object_map::InvalidateRequest: 0x7f975409fee0 should_complete: r=0 > 2018-10-05 20:55:59.414229 7f976d7fa700 -1 librbd::image::OpenRequest: failed to retreive immutable metadata: (2) No such file or directory > 2018-10-05 20:55:59.414321 7f976cff9700 -1 librbd::io::AioCompletion: 0x7f9786df0760 fail: (2) No such file or directory > 2018-10-05 20:56:09.029179 7f976d7fa700 -1 librbd::object_map::RefreshRequest: failed to load object map: rbd_object_map.9b28e148b97af.0000000000006a09 > 2018-10-05 20:56:09.035212 7f976cff9700 -1 librbd::object_map::InvalidateRequest: 0x7f9754644030 should_complete: r=0 > 2018-10-05 20:56:09.036087 7f976d7fa700 -1 librbd::object_map::RefreshRequest: failed to load object map: rbd_object_map.9b28e148b97af.0000000000006a0a > 2018-10-05 20:56:09.042200 7f976cff9700 -1 librbd::object_map::InvalidateRequest: 0x7f97541d2c10 should_complete: r=0 > 6544 22993 1380784 > > # rbd du > warning: fast-diff map is invalid for -1037424/950f705d-d575-11e7-acf6-0242ac114406@-1037424/d2600c5e-d83a-11e7-acf6-0242ac114406. operation may be slow. > warning: fast-diff map is not enabled for -01f5fda8-e57d-11e7-a428-0242ac110705. operation may be slow. > warning: fast-diff map is not enabled for -069b999b-76b7-11e7-9738-0242ac110704. operation may be slow. > warning: fast-diff map is not enabled for -19576951-36ad-11e8-9dc3-0242ac11090d. operation may be slow. > warning: fast-diff map is not enabled for -f519bcc3-3515-11e8-9dc3-0242ac11090d. operation may be slow. > warning: fast-diff map is not enabled for -f8031a79-1a19-11e8-bf57-0242ac11180a. operation may be slow. > warning: fast-diff map is invalid for -875915ce-a156-11e6-9216-000f533054e0@8f003d3a-82be-11e7-a90c-0242ac110704. operation may be slow. > warning: fast-diff map is invalid for -aaf6ca96-9548-11e6-a7f8-000f53304d81@/1b3081ea-af70-11e6-9216-000f533054e0. operation may be slow. > warning: fast-diff map is invalid for -207a435f-569d-11e7-aad7-0242ac110405@3a300bdb-3df5-11e8-9b2b-0242ac116704. operation may be slow. > warning: fast-diff map is invalid for 6aa9d531-57e0-11e7-89c0-0242ac110704@e5406f7f-c37b-11e8-acc7-0a58ac14d11f. operation may be slow. > warning: fast-diff map is invalid for 6aa9d531-57e0-11e7-89c0-0242ac110704@f2b06990-c444-11e8-acc7-0a58ac14d11f. operation may be slow. > warning: fast-diff map is invalid for 6aa9d531-57e0-11e7-89c0-0242ac110704@1df20bdc-c50e-11e8-acc7-0a58ac14d11f. operation may be slow. > > # rbd info rbd/950f705d-d575-11e7-acf6-0242ac114406@-d2600c5e-d83a-11e7-acf6-0242ac114406 > rbd image '950f705d-d575-11e7-acf6-0242ac114406': > size 51200 MB in 12800 objects > order 22 (4096 kB objects) > block_name_prefix: rbd_data.3067d01131bc6e > format: 2 > features: layering, striping, exclusive-lock, object-map, fast-diff, deep-flatten > flags: object map invalid, fast diff invalid. > protected: True > stripe unit: 4096 kB > stripe count: 1 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com