On Thu, 7 Jul 2016, Igor Fedotov wrote: > Varada, > > originally I was thinking about introduction additional container for dirty > Blobs for a transaction. By using smart pointers to Blob class instances one > can easily implement this. But it appeared that we don't use such pointers for > Blobs in other places hence it's not that easy to introduce. And probably > additional lock is good enough solution in the current code. Will think about > that a bit more though. Or let's wait for comments from Sage. Your diagnosis is right... I was thinking the cache->lock covered this but it doesn't protect us while we enumerate the Blobs. I think the best way to solve this is probably to fix // FIXME: use a per-txc dirty blob list? at the same time, and to make Blob refcounted. Take a reference to Blob when it is part of the BlobMap intrusive set, and take another when it is on the txc writing list(/vector). That we we don't need a new lock *and* we avoid traversing the entire blob_map looking for blob buffers to mark clean. > BTW probably I missed something but I can't find the location where Blob > instance is released now for out case (i.e. when it's erased from the blob > map). The only release location I found is at BlobMap::_clear but it doesn't > cover our case IMHO. Any ideas? Or we have some mem leak here? Yeah I think it's a leak. Probably need a delete b (or rather, b->put(), if we make Blob a RefCountedObject) in _wctx_finish. sage > > > > On 07.07.2016 16:30, Varada Kari wrote: > > Hi Igor, > > > > I also have the similar fix. Haven't tested yet completely. > > When i checked the blob id in the failed case, it was zero all the time. > > That verifies the same theory what you have. > > > > But i am wondering, how can we eliminate the lock here? we still need to > > have the > > lock when reading the blob_map entries and we have to maintain the order > > while making the > > changes to blob_map. please correct me i have understood wrongly. > > > > Varada > > > > On Thursday 07 July 2016 06:44 PM, Igor Fedotov wrote: > > > Below is the patch that seems to fix the issue. > > > > > > The root cause IMO is a conflict between blob map removal at > > > _wctx_finish and enumerating over the same blob_map performed at io > > > completion (_txt_state_proc). > > > > > > The case occurs when the first transaction did the async write to some > > > blob and subsequent transaction overwrite this blob. As a result > > > transaction 1 IO completion may interfere with write finish processing > > > for the second one - trans 1 updates buffer cache and enumerates dirty > > > blobs while trans 2 erases overwritten blob. > > > > > > > > > This patch isn't a final fix, it's rather a way to confirm that the > > > root cause is correct. > > > > > > Works for me but I'd like someone else to verify it fixes the issue too. > > > > > > Proper fix should probably refactor the code and introduce additional > > > container for dirty blobs as suggested by the FIXME at _txt_state_proc. > > > > > > > > > diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc > > > > > > index 823ec14..8571e17 100644 > > > --- a/src/os/bluestore/BlueStore.cc > > > +++ b/src/os/bluestore/BlueStore.cc > > > @@ -4618,6 +4618,7 @@ void BlueStore::_txc_state_proc(TransContext *txc) > > > txc->state = TransContext::STATE_KV_QUEUED; > > > // FIXME: use a per-txc dirty blob list? > > > for (auto& o : txc->onodes) { > > > + std::unique_lock<std::mutex> lck(o->blobmap_lock); > > > for (auto& p : o->blob_map.blob_map) { > > > p.bc.finish_write(txc->seq); > > > } > > > @@ -6137,6 +6138,7 @@ void BlueStore::_wctx_finish( > > > dout(20) << __func__ << " rm blob " << *b << dendl; > > > txc->statfs_delta.compressed() -= > > > b->blob.get_compressed_payload_length(); > > > if (l.blob >= 0) { > > > + std::unique_lock<std::mutex> lck(o->blobmap_lock); > > > o->blob_map.erase(b); > > > } else { > > > o->bnode->blob_map.erase(b); > > > diff --git a/src/os/bluestore/BlueStore.h b/src/os/bluestore/BlueStore.h > > > index f380cc9..7493f18 100644 > > > --- a/src/os/bluestore/BlueStore.h > > > +++ b/src/os/bluestore/BlueStore.h > > > @@ -482,6 +482,8 @@ public: > > > BlobMap blob_map; ///< local blobs (this onode onode) > > > > > > std::mutex flush_lock; ///< protect flush_txns > > > + std::mutex blobmap_lock; > > > + > > > std::condition_variable flush_cond; ///< wait here for unapplied > > > txns > > > set<TransContext*> flush_txns; ///< committing or wal txns > > > > > > > > > > > > On 07.07.2016 11:30, Varada Kari wrote: > > > > One more observation on the same lines as Ma Jianpeng, > > > > > > > > For the same onode i see 17 references and 16 flushing_txns. I think we > > > > are getting parallel txns running on the same object. > > > > (gdb) p *(o.px) > > > > $8 = { > > > > nref = { > > > > _M_i = 17 > > > > }, > > > > oid = { > > > > hobj = { > > > > oid = { > > > > name = "rbd_data.100f6b8b4567.", '0' <repeats 14 times>, "1d" > > > > ... > > > > flush_txns = std::set with 16 elements = { > > > > [0] = 0x7fdbbcb73980, > > > > [1] = 0x7fdbe7dd3400, > > > > [2] = 0x7fdbe7dd3900, > > > > [3] = 0x7fdbf7c3cd00, > > > > [4] = 0x7fdc00f46280, > > > > [5] = 0x7fdc00f46c80, > > > > [6] = 0x7fdc00f47900, > > > > [7] = 0x7fdc00f48080, > > > > [8] = 0x7fdc00f48300, > > > > [9] = 0x7fdc00f48800, > > > > [10] = 0x7fdc00f49200, > > > > [11] = 0x7fdc00f49700, > > > > [12] = 0x7fdc09a63680, > > > > [13] = 0x7fdc0ca11e00, > > > > [14] = 0x7fdc0ca12a80, > > > > [15] = 0x7fdc12c2f480 > > > > > > > > I have written my object with 8K before and overwriting with 64k now, > > > > which is showing this behavior. > > > > But does allowing parallel writes harm if they are writing to different > > > > offset and length and guarantee the order? we can check the overlap and > > > > not allow write if they have a overlap, else we are okay to run parallel > > > > writes to the object. > > > > In this case we are missing some locking while updating the blob_map. > > > > Still not able to figure of the exact location. > > > > > > > > Varada > > > > > > > > > > > > On Thursday 07 July 2016 01:23 PM, Ma, Jianpeng wrote: > > > > > I add some debug messges and the message as followings when bug > > > > > reproduce: > > > > > > > > > > 3305972 -102> 2016-07-07 23:48:12.541535 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) queue_transactions existing > > > > > 0xac78760 osr(0.4f 0xb6d2d80) > > > > > 3305973 -101> 2016-07-07 23:48:12.541548 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_create osr 0xac78760 = > > > > > 0x177b0000 seq 611 > > > > > 3305974 -100> 2016-07-07 23:48:12.541563 7f7f4449c700 15 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _write 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 0x1a0000~10000 > > > > > 3305975 -99> 2016-07-07 23:48:12.541574 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 0x1a0000~10000 - have 0x400000 (4194304) bytes in 64 lextents > > > > > 3305976 -98> 2016-07-07 23:48:12.541581 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write prefer csum_order 12 > > > > > comp_blob_size 0x40000 > > > > > 3305977 -97> 2016-07-07 23:48:12.541608 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write_big 0x1a0000~10000 > > > > > max_blob_len 0x10000 compress 0 > > > > > 3305978 -96> 2016-07-07 23:48:12.541617 7f7f4449c700 20 > > > > > bluestore.BufferSpace(0x12012ea8 in 0xa02f240) _discard 0x0~10000 > > > > > 3305979 -95> 2016-07-07 23:48:12.541626 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write_big lex 0x1a0000: > > > > > 0x0~10000->91 > > > > > 3305980 -94> 2016-07-07 23:48:12.541630 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write_big blob 91:blob([] > > > > > clen 0x0 ref_map(0x0~10000=1)) > > > > > 3305981 -93> 2016-07-07 23:48:12.541634 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_alloc_write txc 0x177b0000 1 > > > > > blobs > > > > > 3305982 -92> 2016-07-07 23:48:12.541638 7f7f4449c700 10 > > > > > bitmapalloc:reserve instance 174395776 num_used 1600847 total 18350080 > > > > > 3305983 -91> 2016-07-07 23:48:12.541641 7f7f4449c700 10 > > > > > bitmapalloc:allocate instance 174395776 want_size 0x10000 alloc_unit > > > > > 0x1000 hint 0x0 > > > > > 3305984 -90> 2016-07-07 23:48:12.541642 7f7f55cbf700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0xedfef00 > > > > > aio_wait > > > > > 3305985 -89> 2016-07-07 23:48:12.541651 7f7f55cbf700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finish_io 0xedfef00 > > > > > 3305986 -88> 2016-07-07 23:48:12.541653 7f7f55cbf700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0xedfef00 > > > > > io_done > > > > > 3305987 -87> 2016-07-07 23:48:12.541656 7f7f55cbf700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_procIO_DONE txc > > > > > 0xedfef00 > > > > > 3305988 -86> 2016-07-07 23:48:12.541659 7f7f55cbf700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc onode > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# 0 size=65 > > > > > 3305989 -85> 2016-07-07 23:48:12.541674 7f7f55cbf700 20 > > > > > bluestore.BufferSpace(0x12012548 in 0xa02f240) finish_write > > > > > buffer(0x177d7480 space 0x12012548 0x0~10000 writing nocache) > > > > > 3305990 -84> 2016-07-07 23:48:12.541680 7f7f55cbf700 20 > > > > > bluestore.BufferSpace(0x12012ea8 in 0xa02f240) finish_write > > > > > buffer(0x177d75c0 space 0x12012ea8 0x0~10000 writing nocache) > > > > > 3305991 -83> 2016-07-07 23:48:12.541684 7f7f55cbf700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc 2 > > > > > 3305992 -82> 2016-07-07 23:48:12.541715 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _kv_sync_thread wake > > > > > 3305993 -81> 2016-07-07 23:48:12.541742 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _kv_sync_thread committing 1 > > > > > cleaning 0 > > > > > 3305994 -80> 2016-07-07 23:48:12.541747 7f7f4ccad700 10 > > > > > bitmapalloc:commit_start instance 174395776 releasing 65536 in extents > > > > > 1 > > > > > 3305995 -79> 2016-07-07 23:48:12.541752 7f7f4449c700 20 > > > > > bitmapalloc:allocate instance 174395776 offset 0xc0520000 length > > > > > 0x10000 > > > > > 3305996 -78> 2016-07-07 23:48:12.541757 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_alloc_write blob > > > > > 91:blob([0xc0520000~10000] clen 0x0 mutable ref_map(0x0~10000=1)) > > > > > csum_order 12 csum_length 0x 10000 > > > > > 3305997 -77> 2016-07-07 23:48:12.541784 7f7f4449c700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish onode > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 3305998 -76> 2016-07-07 23:48:12.541795 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish lex_old > > > > > [0x0~10000->27] > > > > > 3305999 -75> 2016-07-07 23:48:12.541801 7f7f4449c700 20 > > > > > bluestore.BufferSpace(0xd2c7c98 in 0xa02f240) _discard 0x0~10000 > > > > > 3306000 -74> 2016-07-07 23:48:12.541804 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish release > > > > > 0xc06d0000~10000 > > > > > 3306001 -73> 2016-07-07 23:48:12.541807 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish rm blob > > > > > 27:blob([!~10000] clen 0x0 mutable+csum crc32c/0x1000) > > > > > 3306002 -72> 2016-07-07 23:48:12.541818 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _write 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 0x1a0000~10000 = 0 > > > > > 3306003 -71> 2016-07-07 23:48:12.541829 7f7f4449c700 15 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _setattrs 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# 2 keys > > > > > 3306004 -70> 2016-07-07 23:48:12.541836 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _setattrs 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# 2 keys = 0 > > > > > 3306005 -69> 2016-07-07 23:48:12.541844 7f7f4449c700 15 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _omap_setkeys 0.4f_head > > > > > #0:f2000000::::head# > > > > > 3306006 -68> 2016-07-07 23:48:12.541856 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _omap_setkeys 0.4f_head > > > > > #0:f2000000::::head# = 0 > > > > > 3306007 -67> 2016-07-07 23:48:12.541855 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finalize_kv txc 0xedfef00 > > > > > allocated 0x[c0510000~10000] released 0x[c06c0000~10000] > > > > > 3306008 -66> 2016-07-07 23:48:12.541863 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_write_nodes txc 0x177b0000 > > > > > onodes 0xa662000 bnodes > > > > > 3306009 -65> 2016-07-07 23:48:12.541864 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finalize_kv overlap 0x[], > > > > > new allocated 0x[c0510000~10000] released 0x[c06c0000~10000] > > > > > 3306010 -64> 2016-07-07 23:48:12.541868 7f7f4ccad700 10 freelist > > > > > allocate 0xc0510000~10000 > > > > > 3306011 -63> 2016-07-07 23:48:12.541870 7f7f4ccad700 20 freelist > > > > > _xor first_key 0xc0500000 last_key 0xc0500000 > > > > > 3306012 -62> 2016-07-07 23:48:12.541881 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finalize_kv release > > > > > 0xc06c0000~10000 > > > > > 3306013 -61> 2016-07-07 23:48:12.541884 7f7f4ccad700 10 freelist > > > > > release 0xc06c0000~10000 > > > > > 3306014 -60> 2016-07-07 23:48:12.541886 7f7f4ccad700 20 freelist > > > > > _xor first_key 0xc0680000 last_key 0xc0680000 > > > > > 3306015 -59> 2016-07-07 23:48:12.541890 7f7f4ccad700 10 > > > > > bitmapalloc:release 0xc06c0000~10000 > > > > > 3306016 -58> 2016-07-07 23:48:12.541954 7f7f4ccad700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _balance_bluefs_freespace bluefs > > > > > 2868 M free (0.999654) bluestore 65426 M free (0.91276), bluefs_ratio > > > > > 0.0419945 > > > > > 3306017 -57> 2016-07-07 23:48:12.541967 7f7f4449c700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_write_nodes onode > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# is 6551 > > > > > 3306018 -56> 2016-07-07 23:48:12.542026 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0x177b0000 > > > > > prepare > > > > > 3306019 -55> 2016-07-07 23:48:12.542031 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _kv_sync_thread committed 1 > > > > > cleaned 0 in 0.000284 > > > > > 3306020 -54> 2016-07-07 23:48:12.542037 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_aio_submit txc 0x177b0000 > > > > > 3306021 -53> 2016-07-07 23:48:12.542037 7f7f4ccad700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0xedfef00 > > > > > kv_queued > > > > > 3306022 -52> 2016-07-07 23:48:12.542044 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finish_kv txc 0xedfef00 > > > > > 3306023 -51> 2016-07-07 23:48:12.542054 7f7f4ccad700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0xedfef00 > > > > > finishing > > > > > 3306024 -50> 2016-07-07 23:48:12.542057 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finish 0xedfef00 onodes > > > > > 0xa662000 > > > > > 3306025 -49> 2016-07-07 23:48:12.542059 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finish onode 0xa662000 had > > > > > 0xedfef00,0x177b0000 > > > > > 3306026 -48> 2016-07-07 23:48:12.542066 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _osr_reap_done osr 0xac78760 > > > > > 3306027 -47> 2016-07-07 23:48:12.542068 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _osr_reap_done txc 0xedfef00 done > > > > > 3306028 -46> 2016-07-07 23:48:12.542074 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _osr_reap_done txc 0x177b0000 > > > > > aio_wait > > > > > 3306029 -45> 2016-07-07 23:48:12.542077 7f7f4ccad700 20 > > > > > bluestore.2QCache(0xa02f240) trim onodes 140 / 16384 buffers 0 / > > > > > 536870912 > > > > > 3306030 -44> 2016-07-07 23:48:12.542081 7f7f4ccad700 10 > > > > > bitmapalloc:commit_finish instance 174395776 released 65536 in extents > > > > > 1 > > > > > 3306031 -43> 2016-07-07 23:48:12.542084 7f7f4ccad700 20 > > > > > bitmapalloc:insert_free instance 174395776 off 0xc06b0000 len 0x10000 > > > > > 3306032 -42> 2016-07-07 23:48:12.542090 7f7f4ccad700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _reap_collections all reaped > > > > > 3306033 -41> 2016-07-07 23:48:12.542093 7f7f4ccad700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _kv_sync_thread sleep > > > > > 3306034 -40> 2016-07-07 23:48:12.542110 7f7f4449c700 5 -- op > > > > > tracker -- seq: 57709, time: 2016-07-07 23:48:12.542109, event: > > > > > reached_pg, op: osd_op(client.4112.0:176011 0.de0d304f (undecoded) > > > > > ack+ondisk+wri te+known_if_redirected e19) > > > > > 3306035 -39> 2016-07-07 23:48:12.542168 7f7f4449c700 5 -- op > > > > > tracker -- seq: 57709, time: 2016-07-07 23:48:12.542167, event: > > > > > started, op: osd_op(client.4112.0:176011 0.de0d304f > > > > > rbd_data.100d2ae8944a.0000000 000000095 [write 1769472~65536] > > > > > snapc 0=[] ack+ondisk+write+known_if_redirected e19) > > > > > 3306036 -38> 2016-07-07 23:48:12.542248 7f7f4449c700 5 write_log > > > > > with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, > > > > > dirty_divergent_priors: false, divergent_priors: 0, writeout_from: > > > > > 19'604, trimmed: > > > > > 3306037 -37> 2016-07-07 23:48:12.542276 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) queue_transactions existing > > > > > 0xac78760 osr(0.4f 0xb6d2d80) > > > > > 3306038 -36> 2016-07-07 23:48:12.542288 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_create osr 0xac78760 = > > > > > 0x1252f980 seq 612 > > > > > 3306039 -35> 2016-07-07 23:48:12.542303 7f7f4449c700 15 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _write 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 0x1b0000~10000 > > > > > 3306040 -34> 2016-07-07 23:48:12.542315 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 0x1b0000~10000 - have 0x400000 (4194304) bytes in 64 lextents > > > > > 3306041 -33> 2016-07-07 23:48:12.542321 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write prefer csum_order 12 > > > > > comp_blob_size 0x40000 > > > > > 3306042 -32> 2016-07-07 23:48:12.542324 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write_big 0x1b0000~10000 > > > > > max_blob_len 0x10000 compress 0 > > > > > 3306043 -31> 2016-07-07 23:48:12.542328 7f7f4449c700 20 > > > > > bluestore.BufferSpace(0x12013358 in 0xa02f240) _discard 0x0~10000 > > > > > 3306044 -30> 2016-07-07 23:48:12.542336 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write_big lex 0x1b0000: > > > > > 0x0~10000->92 > > > > > 3306045 -29> 2016-07-07 23:48:12.542339 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_write_big blob 92:blob([] > > > > > clen 0x0 ref_map(0x0~10000=1)) > > > > > 3306046 -28> 2016-07-07 23:48:12.542344 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_alloc_write txc 0x1252f980 1 > > > > > blobs > > > > > 3306047 -27> 2016-07-07 23:48:12.542346 7f7f4449c700 10 > > > > > bitmapalloc:reserve instance 174395776 num_used 1600847 total 18350080 > > > > > 3306048 -26> 2016-07-07 23:48:12.542350 7f7f4449c700 10 > > > > > bitmapalloc:allocate instance 174395776 want_size 0x10000 alloc_unit > > > > > 0x1000 hint 0x0 > > > > > 3306049 -25> 2016-07-07 23:48:12.542442 7f7f4449c700 20 > > > > > bitmapalloc:allocate instance 174395776 offset 0xc0530000 length > > > > > 0x10000 > > > > > 3306050 -24> 2016-07-07 23:48:12.542455 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _do_alloc_write blob > > > > > 92:blob([0xc0530000~10000] clen 0x0 mutable ref_map(0x0~10000=1)) > > > > > csum_order 12 csum_length 0x 10000 > > > > > 3306051 -23> 2016-07-07 23:48:12.542437 7f7f55cbf700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0x177b0000 > > > > > aio_wait > > > > > 3306052 -22> 2016-07-07 23:48:12.542482 7f7f4449c700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish onode > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 3306053 -21> 2016-07-07 23:48:12.542488 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish lex_old > > > > > [0x0~10000->28] > > > > > 3306054 -20> 2016-07-07 23:48:12.542485 7f7f55cbf700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_finish_io 0x177b0000 > > > > > 3306055 -19> 2016-07-07 23:48:12.542492 7f7f55cbf700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0x177b0000 > > > > > io_done > > > > > 3306056 -18> 2016-07-07 23:48:12.542493 7f7f4449c700 20 > > > > > bluestore.BufferSpace(0xd661088 in 0xa02f240) _discard 0x0~10000 > > > > > 3306057 -17> 2016-07-07 23:48:12.542495 7f7f55cbf700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_procIO_DONE txc > > > > > 0x177b0000 > > > > > 3306058 -16> 2016-07-07 23:48:12.542496 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish release > > > > > 0xc06e0000~10000 > > > > > 3306059 -15> 2016-07-07 23:48:12.542500 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _wctx_finish rm blob > > > > > 28:blob([!~10000] clen 0x0 mutable+csum crc32c/0x1000) > > > > > 3306060 -14> 2016-07-07 23:48:12.542499 7f7f55cbf700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc onode > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# 0 size=65 > > > > > 3306061 -13> 2016-07-07 23:48:12.542510 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _write 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# > > > > > 0x1b0000~10000 = 0 > > > > > 3306062 -12> 2016-07-07 23:48:12.542520 7f7f4449c700 15 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _setattrs 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# 2 keys > > > > > 3306063 -11> 2016-07-07 23:48:12.542527 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _setattrs 0.4f_head > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# 2 keys = 0 > > > > > 3306064 -10> 2016-07-07 23:48:12.542535 7f7f4449c700 15 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _omap_setkeys 0.4f_head > > > > > #0:f2000000::::head# > > > > > 3306065 -9> 2016-07-07 23:48:12.542547 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _omap_setkeys 0.4f_head > > > > > #0:f2000000::::head# = 0 > > > > > 3306066 -8> 2016-07-07 23:48:12.542553 7f7f4449c700 20 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_write_nodes txc 0x1252f980 > > > > > onodes 0xa662000 bnodes > > > > > 3306067 -7> 2016-07-07 23:48:12.542656 7f7f4449c700 0 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_write_nodes onode > > > > > #0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head# is 6552 > > > > > 3306068 -6> 2016-07-07 23:48:12.542703 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_state_proc txc 0x1252f980 > > > > > prepare > > > > > 3306069 -5> 2016-07-07 23:48:12.542712 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) _txc_aio_submit txc 0x1252f980 > > > > > 3306070 -4> 2016-07-07 23:48:12.542800 7f7f4449c700 5 -- op > > > > > tracker -- seq: 57710, time: 2016-07-07 23:48:12.542800, event: > > > > > reached_pg, op: osd_op(client.4112.0:176012 0.de0d304f (undecoded) > > > > > ack+ondisk+wri te+known_if_redirected e19) > > > > > 3306071 -3> 2016-07-07 23:48:12.542860 7f7f4449c700 5 -- op > > > > > tracker -- seq: 57710, time: 2016-07-07 23:48:12.542859, event: > > > > > started, op: osd_op(client.4112.0:176012 0.de0d304f > > > > > rbd_data.100d2ae8944a.0000000 000000095 [write 1835008~65536] > > > > > snapc 0=[] ack+ondisk+write+known_if_redirected e19) > > > > > 3306072 -2> 2016-07-07 23:48:12.542940 7f7f4449c700 5 write_log > > > > > with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, > > > > > dirty_divergent_priors: false, divergent_priors: 0, writeout_from: > > > > > 19'605, trimmed: > > > > > 3306073 -1> 2016-07-07 23:48:12.542979 7f7f4449c700 10 > > > > > bluestore(/mnt/ceph/build/dev/osd2) queue_transactions existing > > > > > 0xac78760 osr(0.4f 0xb6d2d80) > > > > > 3306074 0> 2016-07-07 23:48:12.547121 7f7f55cbf700 -1 *** Caught > > > > > signal (Segmentation fault) ** > > > > > 3306075 in thread 7f7f55cbf700 thread_name:bstore_aio > > > > > 3306076 > > > > > 3306077 ceph version v11.0.0-373-g7dedce8 > > > > > (7dedce8a6e0474d4d14f9aa509428bab84d075af) > > > > > 3306078 1: /mnt/ceph/build/bin/ceph-osd() [0xc329ae] > > > > > 3306079 2: (()+0x10a00) [0x7f7f62923a00] > > > > > 3306080 3: > > > > > (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x8cb) > > > > > [0xb4798b] > > > > > 3306081 4: > > > > > (BlueStore::_txc_finish_io(BlueStore::TransContext*)+0x1bf) [0xb4801f] > > > > > 3306082 5: > > > > > (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x27f) > > > > > [0xb4733f] > > > > > 3306083 6: (KernelDevice::_aio_thread()+0x912) [0xc1a872] > > > > > 3306084 7: (KernelDevice::AioCompletionThread::entry()+0xd) > > > > > [0xc1d81d] > > > > > 3306085 8: (()+0x761a) [0x7f7f6291a61a] > > > > > 3306086 9: (clone()+0x6d) [0x7f7f6059959d] > > > > > 3306087 NOTE: a copy of the executable, or `objdump -rdS > > > > > <executable>` is needed to interpret this. > > > > > > > > > > > > > > > the previous TransContext (3306060)in STATE_IO_DONE and current > > > > > TransaContext call _write(3306039 ). And they handle the same > > > > > object(0:f20cb07b:::rbd_data.100d2ae8944a.0000000000000095:head). So > > > > > the blob_map modify be current TransContext. I always thought: > > > > > objectstore don't allow the two or more write op for same object. > > > > > ObjectContext:: ondisk_write_lock will block the later write op until > > > > > objectstore call on_commit/on_readable Context. > > > > > But in fact, ondisk_write_lock don't prevent other write op. > > > > > > > > > > I think the simple method is add o->flush(). > > > > > Or am I missing something? > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy > > > > > Sent: Wednesday, July 6, 2016 10:48 PM > > > > > To: Mark Nelson <mnelson@xxxxxxxxxx>; Varada Kari > > > > > <Varada.Kari@xxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > > > > > Subject: RE: segfault in bluestore during random writes > > > > > > > > > > I got it ~6/23 , so we need to look for commits before that.. > > > > > > > > > > -----Original Message----- > > > > > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > > > > > Sent: Wednesday, July 06, 2016 7:45 AM > > > > > To: Somnath Roy; Varada Kari; ceph-devel > > > > > Subject: Re: segfault in bluestore during random writes > > > > > > > > > > Ah, sorry I missed it. Do you remember when you first encountered it? > > > > > Might help track down the commit where it first appeared. > > > > > > > > > > Mark > > > > > > > > > > On 07/06/2016 09:32 AM, Somnath Roy wrote: > > > > > > Mark, > > > > > > This is the same trace I got (and posted in community) while you > > > > > > were in vacation. I can still reproduce this in my setup easily. > > > > > > > > > > > > Thanks & Regards > > > > > > Somnath > > > > > > > > > > > > -----Original Message----- > > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > > > > > > Sent: Wednesday, July 06, 2016 7:02 AM > > > > > > To: Varada Kari; ceph-devel > > > > > > Subject: Re: segfault in bluestore during random writes > > > > > > > > > > > > Oh, sorry, this wasn't memdbstore, this was trim cache on reads > > > > > > which got applied earlier this week. I was testing 06178de9 with > > > > > > the trim cache on reads PR applied. > > > > > > > > > > > > Mark > > > > > > > > > > > > On 07/06/2016 07:00 AM, Mark Nelson wrote: > > > > > > > Hi Varada, > > > > > > > > > > > > > > Ah, this version happens to be with the memstoredb patch applied, > > > > > > > though it is unused in this test. I don't believe this should be > > > > > > > causing the problem, but I'll verify it is happening without that > > > > > > > patch as well. > > > > > > > > > > > > > > Mark > > > > > > > > > > > > > > On 07/06/2016 01:45 AM, Varada Kari wrote: > > > > > > > > Hi mark, > > > > > > > > > > > > > > > > I have tried reproducing the issue with latest master, so far no > > > > > > > > luck with 128k and 512k block sizes on a 1000GB rbd image. > > > > > > > > I couldn't find #5e07e5a3446ee39189c5b959be624411561175f5 in > > > > > > > > master, > > > > > > > > are you using any private changes? > > > > > > > > > > > > > > > > I will try some more time today to reproduce the issue. Seems > > > > > > > > blob > > > > > > > > map got corrupted in your case. > > > > > > > > > > > > > > > > Varada > > > > > > > > > > > > > > > > On Tuesday 05 July 2016 09:06 PM, Mark Nelson wrote: > > > > > > > > > Hi Guys, > > > > > > > > > > > > > > > > > > This is primarily for Varada but if anyone else wants to take > > > > > > > > > a > > > > > > > > > look feel free. We are hitting segfaults in tc_state_proc > > > > > > > > > during > > > > > > > > > random writes. I was able to reproduce with logging enabled > > > > > > > > > and > > > > > > > > > got a core dump. I've only briefly looked at the code/logs, > > > > > > > > > but I > > > > > > > > > haven't really looked through it in earnest yet. > > > > > > > > > Unfortunately the > > > > > > > > > logs are about 4GB and the core dump is 2GB, so I'll only > > > > > > > > > attach the last 10k lines. > > > > > > > > > Hopefully it should be fairly reproducible. I'll try to see > > > > > > > > > if I > > > > > > > > > can narrow down commits that have touched the write path over > > > > > > > > > the > > > > > > > > > last 4 weeks and see if anything looks relevant. > > > > > > > > > > > > > > > > > > Varada: let me know if you can reproduce. Thanks! > > > > > > > > > > > > > > > > > > gdb bt: > > > > > > > > > > > > > > > > > > > (gdb) bt > > > > > > > > > > #0 0x00007f9c59708fcb in raise (sig=11) at > > > > > > > > > > ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 > > > > > > > > > > #1 0x0000000000bd7505 in reraise_fatal (signum=11) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/global/signal_handler.cc:69 > > > > > > > > > > #2 handle_fatal_signal (signum=11) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/global/signal_handler.cc:133 > > > > > > > > > > #3 <signal handler called> > > > > > > > > > > #4 next_node (node=<synthetic pointer>) at > > > > > > > > > > /usr/include/boost/intrusive/detail/tree_algorithms.hpp:446 > > > > > > > > > > #5 next_node (p=<synthetic pointer>) at > > > > > > > > > > /usr/include/boost/intrusive/rbtree_algorithms.hpp:352 > > > > > > > > > > #6 operator++ (this=<synthetic pointer>) at > > > > > > > > > > /usr/include/boost/intrusive/detail/tree_node.hpp:128 > > > > > > > > > > #7 BlueStore::_txc_state_proc (this=this@entry=0xaff6000, > > > > > > > > > > txc=txc@entry=0x192c8c80) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueStore.cc:4568 > > > > > > > > > > #8 0x0000000000aeddb7 in BlueStore::_txc_finish_io > > > > > > > > > > (this=this@entry=0xaff6000, txc=txc@entry=0x192c8c80) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueStore.cc:4669 > > > > > > > > > > #9 0x0000000000aed23f in BlueStore::_txc_state_proc > > > > > > > > > > (this=0xaff6000, txc=0x192c8c80) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueStore.cc:4559 > > > > > > > > > > #10 0x0000000000bc134a in KernelDevice::_aio_thread > > > > > > > > > > (this=0xaf20960) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/os/bluestore/KernelDevice.cc:270 > > > > > > > > > > #11 0x0000000000bc588d in > > > > > > > > > > KernelDevice::AioCompletionThread::entry > > > > > > > > > > (this=<optimized out>) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/os/bluestore/KernelDevice.h:49 > > > > > > > > > > #12 0x00007f9c59701dc5 in start_thread (arg=0x7f9c4da2e700) > > > > > > > > > > at > > > > > > > > > > pthread_create.c:308 > > > > > > > > > > #13 0x00007f9c575fc28d in clone () at > > > > > > > > > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > > > > > > > > > (gdb) frame 7 > > > > > > > > > > #7 BlueStore::_txc_state_proc (this=this@entry=0xaff6000, > > > > > > > > > > txc=txc@entry=0x192c8c80) at > > > > > > > > > > /home/ubuntu/src/markhpc/ceph/src/os/bluestore/BlueStore.cc:4568 > > > > > > > > > > 4568for (auto& p : o->blob_map.blob_map) { > > > > > > > > > Mark > > > > > > > > PLEASE NOTE: The information contained in this electronic mail > > > > > > > > message is intended only for the use of the designated > > > > > > > > recipient(s) > > > > > > > > named above. If the reader of this message is not the intended > > > > > > > > recipient, you are hereby notified that you have received this > > > > > > > > message in error and that any review, dissemination, > > > > > > > > distribution, > > > > > > > > or copying of this message is strictly prohibited. If you have > > > > > > > > received this communication in error, please notify the sender > > > > > > > > by > > > > > > > > telephone or e-mail (as shown above) immediately and destroy any > > > > > > > > and > > > > > > > > all copies of this message in your possession (whether hard > > > > > > > > copies > > > > > > > > or electronically stored copies). > > > > > > > > > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > ceph-devel" > > > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > > > > > majordomo > > > > > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > ceph-devel" > > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > > > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > PLEASE NOTE: The information contained in this electronic mail > > > > > > message is intended only for the use of the designated recipient(s) > > > > > > named above. If the reader of this message is not the intended > > > > > > recipient, you are hereby notified that you have received this > > > > > > message in error and that any review, dissemination, distribution, > > > > > > or copying of this message is strictly prohibited. If you have > > > > > > received this communication in error, please notify the sender by > > > > > > telephone or e-mail (as shown above) immediately and destroy any and > > > > > > all copies of this message in your possession (whether hard copies > > > > > > or electronically stored copies). > > > > > > > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html