Re: RGW - Can't download complete object

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Wed, 13 May 2015 20:08:40 -0400 (EDT)

Ok, I dug a bit more, and it seems to me that the problem is with the manifest that was created. I was able to reproduce a similar issue (opened ceph bug #11622), for which I also have a fix.

I created new tests to cover this issue, and we'll get those recent fixes as soon as we can, after we test for any regressions.

Thanks,
Yehuda

----- Original Message -----
> From: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
> To: "Sean Sullivan" <seapasulli@xxxxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Sent: Wednesday, May 13, 2015 2:33:07 PM
> Subject: Re:  RGW - Can't download complete object
> 
> That's another interesting issue. Note that for part 12_80 the manifest
> specifies (I assume, by the messenger log) this part:
> 
> default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80
> (note the 'tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14')
> 
> whereas it seems that you do have the original part:
> default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.12_80
> (note the '2/...')
> 
> The part that the manifest specifies does not exist, which makes me think
> that there is some weird upload sequence, something like:
> 
>  - client uploads part, upload finishes but client does not get ack for it
>  - client retries (second upload)
>  - client gets ack for the first upload and gives up on the second one
> 
> But I'm not sure if it would explain the manifest, I'll need to take a look
> at the code. Could such a sequence happen with the client that you're using
> to upload?
> 
> Yehuda
> 
> ----- Original Message -----
> > From: "Sean Sullivan" <seapasulli@xxxxxxxxxxxx>
> > To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx>
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Sent: Wednesday, May 13, 2015 2:07:22 PM
> > Subject: Re:  RGW - Can't download complete object
> > 
> > Sorry for the delay. It took me a while to figure out how to do a range
> > request and append the data to a single file. The good news is that the end
> > file seems to be 14G in size which matches the files manifest size. The bad
> > news is that the file is completely corrupt and the radosgw log has errors.
> > I am using the following code to perform the download::
> > 
> > https://raw.githubusercontent.com/mumrah/s3-multipart/master/s3-mp-download.py
> > 
> > Here is a clip of the log file::
> > --
> > 2015-05-11 15:28:52.313742 7f570db7d700  1 -- 10.64.64.126:0/1033338 <==
> > osd.11 10.64.64.101:6809/942707 5 ==== osd_op_reply(74566287
> > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_12
> > [read 0~858004] v0'0 uv41308 ondisk = 0) v6 ==== 304+0+858004 (1180387808 0
> > 2445559038) 0x7f53d005b1a0 con 0x7f56f8119240
> > 2015-05-11 15:28:52.313797 7f57067fc700 20 get_obj_aio_completion_cb: io
> > completion ofs=12934184960 len=858004
> > 2015-05-11 15:28:52.372453 7f570db7d700  1 -- 10.64.64.126:0/1033338 <==
> > osd.45 10.64.64.101:6845/944590 2 ==== osd_op_reply(74566142
> > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80
> > [read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6 ====
> > 302+0+0 (3754425489 0 0) 0x7f53d005b1a0 con 0x7f56f81b1f30
> > 2015-05-11 15:28:52.372494 7f57067fc700 20 get_obj_aio_completion_cb: io
> > completion ofs=12145655808 len=4194304
> > 
> > 2015-05-11 15:28:52.372501 7f57067fc700  0 ERROR: got unexpected error when
> > trying to read object: -2
> > 
> > 2015-05-11 15:28:52.426079 7f570db7d700  1 -- 10.64.64.126:0/1033338 <==
> > osd.21 10.64.64.102:6856/1133473 16 ==== osd_op_reply(74566144
> > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.11_12
> > [read 0~3671316] v0'0 uv41395 ondisk = 0) v6 ==== 304+0+3671316 (1695485150
> > 0 3933234139) 0x7f53d005b1a0 con 0x7f56f81e17d0
> > 2015-05-11 15:28:52.426123 7f57067fc700 20 get_obj_aio_completion_cb: io
> > completion ofs=10786701312 len=3671316
> > 2015-05-11 15:28:52.504072 7f570db7d700  1 -- 10.64.64.126:0/1033338 <==
> > osd.82 10.64.64.103:6857/88524 2 ==== osd_op_reply(74566283
> > default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_8
> > [read 0~4194304] v0'0 uv41566 ondisk = 0) v6 ==== 303+0+4194304 (1474509283
> > 0 3209869954) 0x7f53d005b1a0 con 0x7f56f81b1420
> > 2015-05-11 15:28:52.504118 7f57067fc700 20 get_obj_aio_completion_cb: io
> > completion ofs=12917407744 len=4194304
> > 
> > I couldn't really find any good documentation on how fragments/files are
> > layed out on the object file system so I am not sure on where the file will
> > be. How could the 4mb object have issues but the cluster be completely
> > health okay? I did do the rados stat of each object inside ceph and they
> > all
> > appear to be there::
> > 
> > http://paste.ubuntu.com/11118561/
> > 
> > The sum of all of the objects :: 14584887282
> > The stat of the object inside ceph:: 14577056082
> > 
> > So for some reason I have more data in objects than the key manifest. We
> > easiliy identified this object via the same method as the other thread I
> > have::
> > 
> > for key in keys:
> >    ....:     if ( key.name ==
> >    'b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam'
> >    ):
> >    ....:         implicit = key.size
> >    ....:         explicit = conn.get_bucket(bucket).get_key(key.name).size
> >    ....:         absolute = abs(implicit - explicit)
> >    ....:         print key.name
> >    ....:         print implicit
> >    ....:         print explicit
> >    ....:
> > 
> > b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam
> > 14578628946
> > 14577056082
> > 
> > So it looks like I have 3 different sizes. I figure this may be the network
> > issue that was mentioned in the other thread but seeing as this is not the
> > first 512k and the overalll size still matches as well as the errors I am
> > seeing in the gateway I feel that this may be a bigger issue.
> > 
> > Has anyone seen this before?  The only mention of the "got unexpected error
> > when trying to read object" is here
> > (http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2014-May/021688.html)
> > but my google skills are pretty poor.
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com