Re: OSD turned itself off

Greg Farnum <gfarnum@xxxxxxxxxx> · Mon, 16 Feb 2015 08:37:18 -0800

Woah, major thread necromancy! :)

On Feb 13, 2015, at 3:03 PM, Josef Johansson <josef@xxxxxxxxxxx> wrote:
> 
> Hi,
> 
> I skimmed the logs again, as we’ve had more of this kinda errors,
> 
> I saw a lot of lossy connections errors,
> -2567> 2014-11-24 11:49:40.028755 7f6d49367700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.54:0/1011446 pipe(0x19321b80 sd=44 :6819 s=0 pgs=0 cs=0 l=1 c=0x110d2b00).accept replacing existing (lossy) channel (new one lossy=1)
> -2564> 2014-11-24 11:49:42.000463 7f6d51df1700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/1015676 pipe(0x22d6000 sd=204 :6819 s=0 pgs=0 cs=0 l=1 c=0x16e218c0).accept replacing existing (lossy) channel (new one lossy=1)
> -2563> 2014-11-24 11:49:47.704467 7f6d4d1a5700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.52:0/3029106 pipe(0x231f6780 sd=158 :6819 s=0 pgs=0 cs=0 l=1 c=0x136bd1e0).accept replacing existing (lossy) channel (new one lossy=1)
> -2562> 2014-11-24 11:49:48.180604 7f6d4cb9f700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.52:0/2027138 pipe(0x1657f180 sd=254 :6819 s=0 pgs=0 cs=0 l=1 c=0x13273340).accept replacing existing (lossy) channel (new one lossy=1)
> -2561> 2014-11-24 11:49:48.808604 7f6d4c498700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.52:0/2023529 pipe(0x12831900 sd=289 :6819 s=0 pgs=0 cs=0 l=1 c=0x12401600).accept replacing existing (lossy) channel (new one lossy=1)
> -2559> 2014-11-24 11:49:50.128379 7f6d4b88c700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.53:0/1023180 pipe(0x11cb2280 sd=309 :6819 s=0 pgs=0 cs=0 l=1 c=0x1280a000).accept replacing existing (lossy) channel (new one lossy=1)
> -2558> 2014-11-24 11:49:52.472867 7f6d425eb700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.52:0/3019692 pipe(0x18eb4a00 sd=311 :6819 s=0 pgs=0 cs=0 l=1 c=0x10df6b00).accept replacing existing (lossy) channel (new one lossy=1)
> -2556> 2014-11-24 11:49:55.100208 7f6d49e72700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/3021273 pipe(0x1bacf680 sd=353 :6819 s=0 pgs=0 cs=0 l=1 c=0x164ae2c0).accept replacing existing (lossy) channel (new one lossy=1)
> -2555> 2014-11-24 11:49:55.776568 7f6d49468700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/3024351 pipe(0x1bacea00 sd=20 :6819 s=0 pgs=0 cs=0 l=1 c=0x1887ba20).accept replacing existing (lossy) channel (new one lossy=1)
> -2554> 2014-11-24 11:49:57.704437 7f6d49165700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.52:0/1023529 pipe(0x1a32ac80 sd=213 :6819 s=0 pgs=0 cs=0 l=1 c=0xfe93b80).accept replacing existing (lossy) channel (new one lossy=1)
> -2553> 2014-11-24 11:49:58.694246 7f6d47549700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/3017204 pipe(0x102e5b80 sd=370 :6819 s=0 pgs=0 cs=0 l=1 c=0xfb5a000).accept replacing existing (lossy) channel (new one lossy=1)
> -2551> 2014-11-24 11:50:00.412242 7f6d4673b700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.52:0/3027138 pipe(0x1b83b400 sd=250 :6819 s=0 pgs=0 cs=0 l=1 c=0x12922dc0).accept replacing existing (lossy) channel (new one lossy=1)
> -2387> 2014-11-24 11:50:22.761490 7f6d44fa4700  0 -- 10.168.7.23:6840/4010217 >> 10.168.7.25:0/27131 pipe(0xfc60c80 sd=300 :6840 s=0 pgs=0 cs=0 l=1 c=0x1241d080).accept replacing existing (lossy) channel (new one lossy=1)
> -2300> 2014-11-24 11:50:31.366214 7f6d517eb700  0 -- 10.168.7.23:6840/4010217 >> 10.168.7.22:0/15549 pipe(0x193b3180 sd=214 :6840 s=0 pgs=0 cs=0 l=1 c=0x10ebbe40).accept replacing existing (lossy) channel (new one lossy=1)
> -2247> 2014-11-24 11:50:37.372934 7f6d4a276700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/1013890 pipe(0x25d4780 sd=112 :6819 s=0 pgs=0 cs=0 l=1 c=0x10666580).accept replacing existing (lossy) channel (new one lossy=1)
> -2246> 2014-11-24 11:50:37.738539 7f6d4f6ca700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/3026502 pipe(0x1338ea00 sd=230 :6819 s=0 pgs=0 cs=0 l=1 c=0x123f11e0).accept replacing existing (lossy) channel (new one lossy=1)
> -2245> 2014-11-24 11:50:38.390093 7f6d48c60700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/2026502 pipe(0x16ba7400 sd=276 :6819 s=0 pgs=0 cs=0 l=1 c=0x7d4fb80).accept replacing existing (lossy) channel (new one lossy=1)
> -2242> 2014-11-24 11:50:40.505458 7f6d3e43a700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.53:0/1012682 pipe(0x12a53180 sd=183 :6819 s=0 pgs=0 cs=0 l=1 c=0x10537080).accept replacing existing (lossy) channel (new one lossy=1)
> -2198> 2014-11-24 11:51:14.273025 7f6d44ea3700  0 -- 10.168.7.23:6865/5010217 >> 10.168.7.25:0/30755 pipe(0x162bb680 sd=327 :6865 s=0 pgs=0 cs=0 l=1 c=0x16e21600).accept replacing existing (lossy) channel (new one lossy=1)
> -1881> 2014-11-29 00:45:42.247394 7f6d5c155700  0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(949861 rbd_data.1c56a792eb141f2.0000000000006200 [stat,write 2228224~12288] ondisk = 0) v4 remote, 10.168.7.54:0/1025735, failed lossy con, dropping message 0x1bc00400
>  -976> 2015-01-05 07:10:01.763055 7f6d5c155700  0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(11034565 rbd_data.1cc69562eb141f2.00000000000003ce [stat,write 1925120~4096] ondisk = 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 0x12989400
>  -855> 2015-01-10 22:01:36.589036 7f6d5b954700  0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(727627 rbd_data.1cc69413d1b58ba.0000000000000055 [stat,write 2289664~4096] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, failed lossy con, dropping message 0x24f68800
>  -819> 2015-01-12 05:25:06.229753 7f6d3646c700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.53:0/2019809 pipe(0x1f0e9680 sd=460 :6819 s=0 pgs=0 cs=0 l=1 c=0x13090420).accept replacing existing (lossy) channel (new one lossy=1)
>  -818> 2015-01-12 05:25:06.581703 7f6d37534700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.53:0/1025252 pipe(0x1b67a780 sd=71 :6819 s=0 pgs=0 cs=0 l=1 c=0x16311e40).accept replacing existing (lossy) channel (new one lossy=1)
>  -817> 2015-01-12 05:25:21.342998 7f6d41167700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.53:0/1025579 pipe(0x114e8000 sd=502 :6819 s=0 pgs=0 cs=0 l=1 c=0x16310160).accept replacing existing (lossy) channel (new one lossy=1)
>  -808> 2015-01-12 16:01:35.783534 7f6d5b954700  0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(752034 rbd_data.1cc69413d1b58ba.0000000000000055 [stat,write 2387968~8192] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, failed lossy con, dropping message 0x1fde9a00
>  -515> 2015-01-25 18:44:23.303855 7f6d5b954700  0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(46402240 rbd_data.4b8e9b3d1b58ba.0000000000000471 [read 1310720~4096] ondisk = 0) v4 remote, 10.168.7.51:0/1017204, failed lossy con, dropping message 0x250bce00
>  -303> 2015-02-02 22:30:03.140599 7f6d5c155700  0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(17710313 rbd_data.1cc69562eb141f2.00000000000003ce [stat,write 4145152~4096] ondisk = 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 0x1c5d4200
>  -236> 2015-02-05 15:29:04.945660 7f6d3d357700  0 -- 10.168.7.23:6819/10217 >> 10.168.7.51:0/1026961 pipe(0x1c63e780 sd=203 :6819 s=0 pgs=0 cs=0 l=1 c=0x11dc8dc0).accept replacing existing (lossy) channel (new one lossy=1)
>   -66> 2015-02-10 20:20:36.673969 7f6d5b954700  0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(11088 rbd_data.10b8c82eb141f2.0000000000004459 [stat,write 749568~8192] ondisk = 0) v4 remote, 10.168.7.55:0/1005630, failed lossy con, dropping message 0x138db200
> 
> Could this have lead to the data being erroneous, or is the -5 return code just a sign of a broken hard drive?
> 

These are the OSDs creating new connections to each other because the previous ones failed. That's not necessarily a problem (although here it's probably a symptom of some kind of issue, given the frequency) and cannot introduce data corruption of any kind.
I’m not seeing any -5 return codes as part of that messenger debug output, so unless you were referring to your EIO from last June I’m not sure what that’s about? (If you do mean EIOs, yes, they’re still a sign of a broken hard drive or local FS.)

> Cheers,
> Josef
> 
>> On 14 Jun 2014, at 02:38, Josef Johansson <josef@xxxxxxxxxxx> wrote:
>> 
>> Thanks for the quick response.
>> 
>> Cheers,
>> Josef
>> 
>> Gregory Farnum skrev 2014-06-14 02:36:
>>> On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson <josef@xxxxxxxxxxx> wrote:
>>>> Hi Greg,
>>>> 
>>>> Thanks for the clarification. I believe the OSD was in the middle of a deep
>>>> scrub (sorry for not mentioning this straight away), so then it could've
>>>> been a silent error that got wind during scrub?
>>> Yeah.
>>> 
>>>> What's best practice when the store is corrupted like this?
>>> Remove the OSD from the cluster, and either reformat the disk or
>>> replace as you judge appropriate.
>>> -Greg
>>> 
>>>> Cheers,
>>>> Josef
>>>> 
>>>> Gregory Farnum skrev 2014-06-14 02:21:
>>>> 
>>>>> The OSD did a read off of the local filesystem and it got back the EIO
>>>>> error code. That means the store got corrupted or something, so it
>>>>> killed itself to avoid spreading bad data to the rest of the cluster.
>>>>> -Greg
>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>> 
>>>>> 
>>>>> On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson <josef@xxxxxxxxxxx>
>>>>> wrote:
>>>>>> Hey,
>>>>>> 
>>>>>> Just examing what happened to an OSD, that was just turned off. Data has
>>>>>> been moved away from it, so hesitating to turned it back on.
>>>>>> 
>>>>>> Got the below in the logs, any clues to what the assert talks about?
>>>>>> 
>>>>>> Cheers,
>>>>>> Josef
>>>>>> 
>>>>>> -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t,
>>>>>> const
>>>>>> hobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7fdacb88
>>>>>> c700 time 2014-06-11 21:13:54.036982
>>>>>> os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio
>>>>>> ||
>>>>>> got != -5)
>>>>>> 
>>>>>>  ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
>>>>>>  1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned
>>>>>> long,
>>>>>> ceph::buffer::list&, bool)+0x653) [0x8ab6c3]
>>>>>>  2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
>>>>>> std::vector<OSDOp,
>>>>>> std::allocator<OSDOp> >&)+0x350) [0x708230]
>>>>>>  3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86)
>>>>>> [0x713366]
>>>>>>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3095)
>>>>>> [0x71acb5]
>>>>>>  5: (PG::do_request(std::tr1::shared_ptr<OpRequest>,
>>>>>> ThreadPool::TPHandle&)+0x3f0) [0x812340]
>>>>>>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>>>>>> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2ea) [0x75c80a]
>>>>>>  7: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>,
>>>>>> ThreadPool::TPHandle&)+0x198) [0x770da8]
>>>>>>  8: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>,
>>>>>> std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG>
>>>>>>> ::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x7a89
>>>>>> ce]
>>>>>>  9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b5dea]
>>>>>>  10: (ThreadPool::WorkThread::entry()+0x10) [0x9b7040]
>>>>>>  11: (()+0x6b50) [0x7fdadffdfb50]
>>>>>>  12: (clone()+0x6d) [0x7fdade53b0ed]
>>>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>>> needed to
>>>>>> interpret this.
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com