Re: Bug in rados bench with 0.94.6 (regression, not present in 0.94.5)

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 26 Feb 2016 11:54:24 +0100

I can reproduce and updated the ticket. (I only upgraded the client,
not the server).

It seems to be related to the new --no-verify option, which is giving
strange results -- see the ticket.

-- Dan

On Fri, Feb 26, 2016 at 11:48 AM, Alexey Sheplyakov
<asheplyakov@xxxxxxxxxxxx> wrote:
> Christian,
>
>> Note that "rand" works fine, as does "seq" on a 0.95.5 cluster.
>
> Could you please check if 0.94.5 ("old") *client* works with 0.94.6
> ("new") servers, and vice a versa?
>
> Best regards,
>      Alexey
>
>
> On Fri, Feb 26, 2016 at 9:44 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>>
>> Hello,
>>
>> On my crappy test cluster (Debian Jessie, Hammer 0.94.6) I'm seeing rados
>> bench crashing doing "seq" runs.
>> As I'm testing cache tiers at the moment I also tried it with a normal,
>> replicated pool with the same result.
>>
>> After creating some benchmark objects with:
>> ---
>> rados -p data bench 20 write -t 32 --no-cleanup
>> ---
>>
>> A consecutive run of this ends in tears:
>> ---
>> # rados -p data bench 10 seq -t 32
>>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>      0       0         0         0         0         0         -         0
>> rados: ./common/Mutex.h:96: void Mutex::_pre_unlock(): Assertion `nlock > 0' failed.
>> *** Caught signal (Aborted) **
>>  in thread 7f1894100780
>>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>  1: rados() [0x4e5e23]
>>  2: (()+0xf8d0) [0x7f18915268d0]
>>  3: (gsignal()+0x37) [0x7f188fde6067]
>>  4: (abort()+0x148) [0x7f188fde7448]
>>  5: (()+0x2e266) [0x7f188fddf266]
>>  6: (()+0x2e312) [0x7f188fddf312]
>>  7: (Mutex::Unlock()+0xb3) [0x4fda93]
>>  8: (ObjBencher::seq_read_bench(int, int, int, int, bool)+0x127c) [0x4da37c]
>>  9: (ObjBencher::aio_bench(int, int, int, int, int, bool, char const*, bool)+0x2df) [0x4ded8f]
>>  10: (main()+0xa664) [0x4be834]
>>  11: (__libc_start_main()+0xf5) [0x7f188fdd2b45]
>>  12: rados() [0x4c2c97]
>> 2016-02-26 14:18:52.641052 7f1894100780 -1 *** Caught signal (Aborted) **
>>  in thread 7f1894100780
>> ---
>>
>> There's nothing particular outstanding or malicious in the recent events,
>> here are the last 2:
>> ---
>>     -2> 2016-02-26 14:23:12.439214 7f18c113f780  1 -- 10.0.0.83:0/877189211 --> 10.0.0.85:6804/2921 -- osd_op(client.31691145.0:34 benchmark_data_engtest03_32406_object32 [read 0~4096] 0.def1bb6e ack+read+known_if_redirected e11724) v5 -- ?+0 0x39090d0 con 0x389bed0
>>     -1> 2016-02-26 14:23:12.439930 7f18b4549700  1 -- 10.0.0.83:0/877189211 <== osd.11 10.0.0.34:6802/2973 1 ==== osd_op_reply(9 benchmark_data_engtest03_32406_object7 [read 0~4096] v0'0 uv15 ondisk = 0) v6 ==== 205+0+4096 (2792458300 0 1108541644) 0x7f1864000ca0 con 0x38bbf80
>> ---
>>
>> Note that "rand" works fine, as does "seq" on a 0.95.5 cluster.
>>
>> While certainly not production related (or so one hopes!), this cinches it
>> for me, no upgrade to .6 tomorrow on the mission critical cluster.
>>
>> Also created a tracker issue, despite resounding success (none, it
>> probably was silently fixed ^o^) of my previous one:
>> http://tracker.ceph.com/issues/14873
>>
>> Christian
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com