Re: Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Tue, 26 Mar 2013 01:30:02 -0700

On 03/25/2013 03:04 AM, Oliver Francke wrote:
Hi josh,

logfile is attached...

Thanks. It shows nothing out of the ordinary, but I just reproduced the
incorrect rollback locally, so it shouldn't be hard to track down from
here.

I opened http://tracker.ceph.com/issues/4551 to track it.

Josh

On 03/22/2013 08:30 PM, Josh Durgin wrote:
On 03/22/2013 12:09 PM, Oliver Francke wrote:
Hi Josh, all,

I did not want to hijack the thread dealing with a crashing VM, but
perhaps there are some common things.

Today I installed a fresh cluster with mkephfs, went fine, imported a
"master" debian 6.0 image with "format 2", made a snapshot, protected
it, and made some clones.
Clones mounted with qemu-nbd, fiddled a bit with
IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started,
took 2 secs and the VM was up n running. Cool.

Now an ordinary shutdown was performed, made a snapshot of this
image. Started again, did some "apt-get update… install s/t…".
Shutdown -> rbd rollback -> startup again -> login -> install s/t
else… filesystem showed "many" ex3-errors, fell into read-only mode,
massive corruption.

This sounds like it might be a bug in rollback. Could you try cloning
and snapshotting again, but export the image before booting, and after
rolling back, and compare the md5sums?

Done, first MD5-mismatch after 32 4MB blocks, checked with dd and a bs
of 4MB.

Running the rollback with:

--debug-ms 1 --debug-rbd 20 --log-file rbd-rollback.log

might help too. Does your ceph.conf where you ran the rollback have
anything related to rbd_cache in it?

No cache settings in global ceph.conf.

Hope it helps,

Oliver.

qemu config was with ":rbd_cache=false" if it matters. Above scenario
is reproducible, and as I stated out, no crash detected.

Perhaps it is in the same area as in the crash-thread, otherwise I
will provide logfiles as needed.

It's unrelated, the other thread is an issue with the cache, which does
not cause corruption but triggers a crash.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html