question about futex calls in OSD

Xing <xinglin@xxxxxxxxxxx> · Fri, 25 Apr 2014 11:40:13 -0600

Hi,

This is Xing, a graduate student at the University of Utah. I am playing with Ceph and have a few questions about futex calls in OSD process. I am using Ceph-v0.79. The data node machine has 7 disks, one for OS and the other 6 disks for running 6 OSDs. I set replica size = 1, for all three pools (this is to improve sequential read bandwidth. Since ceph is doing replication synchronously, it layouts data in a fashion similar as RAID 10 with the near format for replica size=2 with two OSDs. Each OSD will store exactly the same copy of whole data while during read back, only half blocks will be read from each OSD: it will read one block, seek over the next block and then read the third block. Ideally, to get full disk bandwidth, we should layout data as the far format in RAID 10. I do not know how to do that in Ceph and thus simply tried with replicasize=1). I created a rbd block device, initialized with an ext4 fs, stored a 10GB file into it and then read it back sequentially. I set the read_ahead to be 128M or 1G, to get the maximum bandwidth for a single rbd block device (I know this is not realistic). 

While these reads requests are being served, I used strace to capture all system calls for an OSD thread that actually does reads. I observed lots of futex system calls. By specifying -T for strace, it reports time spent by each system call and I summed it up to come up with total time spent for each system call. I further break down two main futex calls that contributes most to its overhead.
----------
syscall    runtime (s)   call num  average (s)
pread      11.1028       420       0.026435
fgetxattr  0.178381      1680      0.000106
futex      8.83125       5217
total runtime: 21

                           runtime (s)  call num average (s)
futex(WAIT_PRIVATE):       4.97166      1415     0.003513
futex(WAIT_BITSET_PRIVATE  3.79         51       0.0743
----------

The overhead of lock seems to be quite high. I image this could become worse as I increase the number of workloads. I was wondering why there are so many futex calls and took some time to look into it. It appears to me that there are three locks used during the read path in OSD. These locks are Mutexes and I believe the futex calls I observed in straces are results of operations on these Mutexes. 

a. snapset_contexts_lock, used in functions such as get_snapset_context() and put_snapset_context()

b. fdcache_lock, used in lfn_open() and such

c. ondisk_read_lock, used in execute_ctx(). 

The one that affects most is the snapset_contexts_lock: this lock seems to be the monolithic lock for controlling access to all object files in a snapset or to all object files within the same placement group and belonging to the same snapset (not sure what a ’snapset’ means. It sounds like equivalent to a ’snapshot’.). To read a block from an object file, OSD needs to first read two extended attributes (OI_ATTR and SS_ATTR) for that file. For each read of these attributes, it seems the snapset_contexts_lock is involved: the SS_ATTR attribute is read inside the get_snapset_context() function; the OI_ATTR attribute is read in the function get_object_context() which could be called from the function find_object_context(). In the find_object_context(), I can also find a few calls to get_snapset_context() function. To release a snapset context, the snapset_contexts_lock is also involved (in the put_snapset_context()). 

Here are a few questions.
1. What is the snapset_contexts_lock used for? Is it used to control accesses to all files in a snapset or all files in the same placement group and belonging to the same snapset? Such a big lock design does not seem to be scale. Any comments?
2. Has anyone noticed the overhead in locking? I tried to comment out the lock in get_snapset_context() and put_snapset_context(), by commenting the Mutex instantiation statement but that turns the system to be not working: I could not list/mount rbd block devices. 

Thanks very much for reading this. Any comment is welcome. 
Thanks,
Xing

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html