Re: ceph-osd failure following 0.92 -> 0.94 upgrade

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've gone through the ceph-users mailing list and the only suggested fix (by Sage) was to roll back to V0.92, do ceph-osd -i NNN --flush-journal and then upgrade to V0.93 (which was the issue at the time).

However, I've done that and the V0.92 code faults for a different reason, which I suspect is a transaction added when the V0.94 code started to run. Out of 60 OSD's, about 50-55 have this problem.

My three solutions would seem to be:
(1) rebuild the journals losing all the journal transactions (not ideal)
(2) git clone the v0.92 code, modify the journal commit code to not barf on the V0.94 transactions
(3) git cone the v0.94 code, modify the journal commit code to not barf on the V0.92 transactions

Option #1 would lead to data loss but I think not OSD loss (which would be terrible).

Option #3 would seem more sensible than Option #2, but I assume that if #3 was easy to do,
then it would have been included in the V0.94 codebase instead of the errata in the V0.80upgrade
comments which got me into this fix.

Suggestions of which is the better route or an alternate fix? Right now, I have ~55 useless OSD's
and a lot of lost data.

On Thu, Apr 9, 2015 at 7:13 PM, Dirk Grunwald <Dirk.Grunwald@xxxxxxxxxxxx> wrote:
The solution to prevent this now (hours long) fix on my part was buried in material
labeled as "upgrade form 0.80x giant".

To prevent others from having the same issue, it may make sense to move the 0.92
issue to the forefront, like the single 0.93 issue called out.



On Thu, Apr 9, 2015 at 5:34 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
If you dig into the list archives I think somebody else went through
this when the issue was discovered and recovered successfully. But I
don't know the details. :)
-Greg

On Thu, Apr 9, 2015 at 3:38 PM, Dirk Grunwald
<Dirk.Grunwald@xxxxxxxxxxxx> wrote:
> Aha. That would have been useful to see -- I saw the notice about 0.93, but
> not that.
>
> when I roll back to v0.92, I get a different error (see below)
>
> This doesn't seem very happy - any suggestions?
>
>
> root@zfs2:~/XYZZY/v92# ceph-osd -d -i 4 --flush-journal
> 2015-04-09 16:31:44.756113 7f987f822900  0 ceph version 0.92
> (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0), process ceph-osd, pid 12605
> 2015-04-09 16:31:44.758743 7f987f822900  0
> filestore(/var/lib/ceph/osd/ceph-4) backend btrfs (magic 0x9123683e)
> 2015-04-09 16:31:44.807613 7f987f822900  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: FIEMAP
> ioctl is supported and appears to work
> 2015-04-09 16:31:44.807673 7f987f822900  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config opt\
> ion
> 2015-04-09 16:31:45.148028 7f987f822900  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features: syncfs(2)
> syscall fully supported (by glibc and kernel)
> 2015-04-09 16:31:45.148163 7f987f822900  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: CLONE_RANGE
> ioctl is supported
> 2015-04-09 16:31:45.923009 7f987f822900  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: SNAP_CREATE
> is supported
> 2015-04-09 16:31:45.923673 7f987f822900  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: SNAP_DESTROY
> is supported
> 2015-04-09 16:31:45.923979 7f987f822900  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: START_SYNC
> is supported (transid 372081)
> 2015-04-09 16:31:46.381367 7f987f822900  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: WAIT_SYNC is
> supported
> 2015-04-09 16:31:46.724449 7f987f822900  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature:
> SNAP_CREATE_V2 is supported
> 2015-04-09 16:31:47.473175 7f987f822900  0
> filestore(/var/lib/ceph/osd/ceph-4) mount: enabling PARALLEL journal mode:
> fs, checkpoint is enabled
>  HDIO_DRIVE_CMD(identify) failed: Invalid argument
> 2015-04-09 16:31:47.495711 7f987f822900  1 journal _open
> /var/lib/ceph/osd/ceph-4/journal fd 16: 1072693248 bytes, block size 4096
> bytes, directio = 1, aio = 1
> terminate called after throwing an instance of
> 'ceph::buffer::malformed_input'
>   what():  buffer::malformed_input: __PRETTY_FUNCTION__ unknown encoding
> version > 8
> *** Caught signal (Aborted) **
>  in thread 7f987f822900
>  ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0)
>  1: ceph-osd() [0xac511a]
>  2: (()+0x10340) [0x7f987e4da340]
>  3: (gsignal()+0x39) [0x7f987c979cc9]
>  4: (abort()+0x148) [0x7f987c97d0d8]
>
>
> On Thu, Apr 9, 2015 at 3:22 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>
>> On Thu, Apr 9, 2015 at 2:05 PM, Dirk Grunwald
>> <Dirk.Grunwald@xxxxxxxxxxxx> wrote:
>> > Ceph cluster, U14.10 base system, OSD's using BTRFS, journal on same
>> > disk as
>> > partition
>> > (done using ceph-deploy)
>> >
>> > I had been running 0.92 without (significant) issue. I upgraded
>> > to Hammer (0.94) be modifying /etc/apt/sources.list, apt-get update,
>> > apt-get
>> > upgrade
>> >
>> > Upgraded and restarted ceph-mon and then ceph-osd
>> >
>> > Most of the 50 OSD's are in a failure cycle with the error
>> > "os/Transaction.cc: 504: FAILED assert(ops == data.ops)"
>> >
>> > Right now, the entire cluster is useless because of this.
>> >
>> > Any suggestions?
>>
>> It looks like maybe it's under the v80.x section instead of general
>> upgrading, but the release notes include:
>>
>> * If you are upgrading specifically from v0.92, you must stop all OSD
>>   daemons and flush their journals (``ceph-osd -i NNN
>>   --flush-journal``) before upgrading.  There was a transaction
>>   encoding bug in v0.92 that broke compatibility.  Upgrading from v0.93,
>>   v0.91, or anything earlier is safe.
>>
>



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux