Re: Part 2: ssd osd fails often with "FAILED assert(soid < scrubber.start || soid >= scrubber.end)"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

This is http://tracker.ceph.com/issues/8011 which is being backported.

Cheers

On 13/01/2015 22:00, Udo Lembke wrote:
> Hi again,
> sorry for not threaded, but my last email don't came back on the mailing
> list (often miss some posts!).
> 
> Just after sending the last mail, the first time another SSD fails - in
> this case an cheap one, but with the same error:
> 
> root@ceph-04:/var/log/ceph# more ceph-osd.62.log
> 2015-01-13 16:40:55.712967 7fb29cfd3700  0 log [INF] : 17.2 scrub ok
> 2015-01-13 17:54:35.548361 7fb29dfd5700  0 log [INF] : 17.3 scrub ok
> 2015-01-13 17:54:38.007014 7fb29dfd5700  0 log [INF] : 17.5 scrub ok
> 2015-01-13 17:54:41.215558 7fb29d7d4700  0 log [INF] : 17.f scrub ok
> 2015-01-13 17:54:42.277585 7fb29dfd5700  0 log [INF] : 17.a scrub ok
> 2015-01-13 17:54:48.961582 7fb29d7d4700  0 log [INF] : 17.6 scrub ok
> 2015-01-13 20:15:08.749597 7fb292337700  0 -- 192.168.3.14:6824/9185 >>
> 192.168.3.15:6824/11735 pipe(0x107d9680 sd=307 :6824 s=2 pgs=2 cs=1
> l=0 c=0x124a09a0).fault, initiating reconnect
> 2015-01-13 20:15:08.750803 7fb296dbe700  0 -- 192.168.3.14:0/9185 >>
> 192.168.3.15:6825/11735 pipe(0xd011180 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x
> 8d19760).fault
> 2015-01-13 20:15:08.750804 7fb292b3f700  0 -- 192.168.3.14:0/9185 >>
> 172.20.2.15:6837/11735 pipe(0x1210f900 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x
> beae840).fault
> 2015-01-13 20:15:08.751056 7fb291d31700  0 -- 192.168.3.14:6824/9185 >>
> 192.168.3.15:6824/11735 pipe(0x107d9680 sd=29 :6824 s=1 pgs=2 cs=2 l
> =0 c=0x124a09a0).fault
> 2015-01-13 20:15:27.035342 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:07.035339)
> 2015-01-13 20:15:28.036773 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.036769)
> 2015-01-13 20:15:28.945179 7fb29b7d0700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:08.945178)
> 2015-01-13 20:15:29.037016 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:09.037014)
> 2015-01-13 20:15:30.037204 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.037202)
> 2015-01-13 20:15:30.645491 7fb29b7d0700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:10.645483)
> 2015-01-13 20:15:31.037326 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:11.037323)
> 2015-01-13 20:15:32.037442 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:12.037439)
> 2015-01-13 20:15:33.037641 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:13.037637)
> 2015-01-13 20:15:34.037843 7fb2b3edd700 -1 osd.62 116422
> heartbeat_check: no reply from osd.61 since back 2015-01-13
> 20:15:06.843259 front 2
> 015-01-13 20:15:06.843259 (cutoff 2015-01-13 20:15:14.037839)
> 2015-01-13 21:39:35.241153 7fb29dfd5700  0 log [INF] : 17.d scrub ok
> 2015-01-13 21:39:39.293113 7fb29a7ce700 -1 osd/ReplicatedPG.cc: In
> function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bo
> ol)' thread 7fb29a7ce700 time 2015-01-13 21:39:39.279799
> osd/ReplicatedPG.cc: 5306: FAILED assert(soid < scrubber.start || soid
>> = scrubber.end)
> 
>  ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>  1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int,
> bool)+0x1320) [0x9296b0]
>  2:
> (ReplicatedPG::try_flush_mark_clean(boost::shared_ptr<ReplicatedPG::FlushOp>)+0x5f6)
> [0x92b076]
>  3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x296)
> [0x92b876]
>  4: (C_Flush::finish(int)+0x86) [0x986226]
>  5: (Context::complete(int)+0x9) [0x78f449]
>  6: (Finisher::finisher_thread_entry()+0x1c8) [0xad5a18]
>  7: (()+0x6b50) [0x7fb2b94ceb50]
>  8: (clone()+0x6d) [0x7fb2b80dc7bd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> --- begin dump of recent events ---
>   -127> 2015-01-10 19:39:41.861724 7fb2b9faa780  5 asok(0x28e4230)
> register_command perfcounters_dump hook 0x28d4010
>   -126> 2015-01-10 19:39:41.861749 7fb2b9faa780  5 asok(0x28e4230)
> register_command 1 hook 0x28d4010
>   -125> 2015-01-10 19:39:41.861753 7fb2b9faa780  5 asok(0x28e4230)
> register_command perf dump hook 0x28d4010
>   -124> 2015-01-10 19:39:41.861756 7fb2b9faa780  5 asok(0x28e4230)
> register_command perfcounters_schema hook 0x28d4010
>   -123> 2015-01-10 19:39:41.861759 7fb2b9faa780  5 asok(0x28e4230)
> register_command 2 hook 0x28d4010
>   -122> 2015-01-10 19:39:41.861762 7fb2b9faa780  5 asok(0x28e4230)
> register_command perf schema hook 0x28d4010
>   -121> 2015-01-10 19:39:41.861764 7fb2b9faa780  5 asok(0x28e4230)
> register_command config show hook 0x28d4010
>   -120> 2015-01-10 19:39:41.861768 7fb2b9faa780  5 asok(0x28e4230)
> register_command config set hook 0x28d4010
>   -119> 2015-01-10 19:39:41.861773 7fb2b9faa780  5 asok(0x28e4230)
> register_command config get hook 0x28d4010
>   -118> 2015-01-10 19:39:41.861779 7fb2b9faa780  5 asok(0x28e4230)
> register_command log flush hook 0x28d4010
>   -117> 2015-01-10 19:39:41.861784 7fb2b9faa780  5 asok(0x28e4230)
> register_command log dump hook 0x28d4010
>   -116> 2015-01-10 19:39:41.861789 7fb2b9faa780  5 asok(0x28e4230)
> register_command log reopen hook 0x28d4010
>   -115> 2015-01-10 19:39:41.864385 7fb2b9faa780  0 ceph version 0.80.7
> (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-osd, pid 918
> 5
>   -114> 2015-01-10 19:39:41.873624 7fb2b9faa780  1 finished
> global_init_daemonize
>   -113> 2015-01-10 19:39:41.892039 7fb2b9faa780  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-62) detect_features:
> FIEMAP ioctl is suppo
> rted and appears to work
>   -112> 2015-01-10 19:39:41.892081 7fb2b9faa780  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-62) detect_features:
> FIEMAP ioctl is disab
> led via 'filestore fiemap' config option
>   -111> 2015-01-10 19:39:41.902334 7fb2b9faa780  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-62) detect_features:
> syscall(SYS_syncfs, f
> d) fully supported
>   -110> 2015-01-10 19:39:41.983875 7fb2b9faa780  0
> filestore(/var/lib/ceph/osd/ceph-62) limited size xattrs
>   -109> 2015-01-10 19:39:42.112708 7fb2b9faa780  0
> filestore(/var/lib/ceph/osd/ceph-62) mount: enabling WRITEAHEAD journal
> mode: checkpoint
> 
> Udo
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux