On 11/26/2012 10:19 AM, Yan, Zheng wrote: > On 11/26/2012 06:32 AM, Sage Weil wrote: >> I pushed an alternative approach to wip-tmap. >> >> This sorting is an artifact of tmap's crummy implementation, and the mds >> workaround will need to get reverted when we switch to omap. Instead, fix >> tmap so that it will tolerate unsorted keys. (Also, drop the ENOENT on rm >> on missing key.) >> >> Eventually we can deprecate and remove tmap entirely... >> >> What do you think? > > This approach is cleaner than mine. But I think your fix isn't enough because > MDS may provide tmap contains misordered items to the TMAPPUT method. Misordered > items will confuse future TMAPUP. This fix is either sorting items when handling > TMAPPUT or searching forward for any potential misordered items when TMAP_SET > wants to add a new item or TMAP_RM fails to find an item. > how about the patch attached below ----- >From e3c4fb68dc6c7592b7f53ab7a98b561167b567df Mon Sep 17 00:00:00 2001 From: "Yan, Zheng" <zheng.z.yan@xxxxxxxxx> Date: Mon, 26 Nov 2012 12:28:30 +0800 Subject: [PATCH] osd: check misordered items in TMAP There is a bug in the MDS that causes misordered items in TMAPs that store dir fragments. Misordered items confuse TMAP updates. Fix this by adding code to do_tmapup() to check if there are misordered items that may affect the correctness of TMAP update. Fall back to use do_tmapup_slow if misordered items are found. Signed-off-by: Yan, Zheng <zheng.z.yan@xxxxxxxxx> --- src/osd/ReplicatedPG.cc | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index 48cba10..71f5363 100644 --- a/src/osd/ReplicatedPG.cc +++ b/src/osd/ReplicatedPG.cc @@ -1803,8 +1803,17 @@ int ReplicatedPG::do_tmapup(OpContext *ctx, bufferlist::iterator& bp, OSDOp& osd nkeys--; } if (!ip.end()) { + string last_key = nextkey; + ::decode(nextkey, ip); ::decode(nextval, ip); + + if (nextkey <= last_key) { + dout(0) << "tmapup warning: key '" << key << "' < previous key '" << last_key + << "', falling back to an inefficient (unsorted) update" << dendl; + bp = orig_bp; + return do_tmapup_slow(ctx, bp, osd_op, newop.outdata); + } } else { have_next = false; } @@ -1848,6 +1857,35 @@ int ReplicatedPG::do_tmapup(OpContext *ctx, bufferlist::iterator& bp, OSDOp& osd ::encode(nextkey, newkeydata); ::encode(nextval, newkeydata); dout(20) << " keep " << nextkey << " " << nextval.length() << dendl; + + /* + * TMAPs for storing dir fragments may contain misordered items. + * Only items corresponding to dentries that have the same name + * prefix can be out of order. + */ + size_t lastlen = nextkey.find_last_of('_'); + if (lastlen > 0 && lastlen != string::npos) { + string last_key = nextkey; + bufferlist::iterator tmp_ip = ip; + while (!tmp_ip.end()) { + ::decode(nextkey, tmp_ip); + ::decode(nextval, tmp_ip); + + if (nextkey <= last_key) { + dout(0) << "tmapup warning: key '" << nextkey << "' < previous key '" << last_key + << "', falling back to an inefficient (unsorted) update" << dendl; + bp = orig_bp; + return do_tmapup_slow(ctx, bp, osd_op, newop.outdata); + } + + size_t len = nextkey.find_last_of('_'); + if (len == 0 || len == string::npos) + break; + len = min(len, lastlen); + if (last_key.compare(0, len, nextkey, 0, len) < 0) + break; + } + } } if (!ip.end()) { bufferlist rest; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html