races in dict_foreach() causing crashes in tier-file-creat.t

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Fri, 11 Mar 2016 16:40:16 +0530

hi,
      I think this is the RCA for the issue:
    Basically with distributed ec + disctributed replicate as cold, hot 
tiers. tier
    sends a lookup which fails on ec. (By this time dict already 
contains ec
    xattrs) After this lookup_everywhere code path is hit in tier which 
triggers
    lookup on each of distribute's hash lookup but fails which leads to 
the cold,
    hot dht's lookup_everywhere in two parallel epoll threads where in 
ec's thread it
    tries to set trusted.ec.version/dirty/size in the dictionary, the older
    values against the same key get erased. While this erasing is going 
on if the
    thread that is doing lookup on afr's subvolume accesses these 
members either in
    dict_copy_with_ref or client xlator trying to serialize, that can 
either lead
    to crash or hang based on when the spin/mutex lock is called on 
invalid memory.

At the moment I sent http://review.gluster.org/13680 (I am pressed for 
time because I need to provide a build for our customer with a fix), 
which avoids parallel accesses of elements which step on each other.

Raghavendra G and I discussed about this problem and the right way to 
fix it is to take a copy(without dict_foreach) of the dictionary in 
dict_foreach inside a lock and then loop over the local dictionary. I am 
worried about the performance implication of this, so wondering if 
anyone has a better idea.

Also included Xavi, who earlier said we need to change dict.c but it is 
a bigger change. May be the time has come? I would love to gather all 
your inputs and implement a better version of dict if we need one.

Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel