+1 for "no-rewinddir-support" option in DHT. We are seeing very slow directory listing specially with 1500+ brick volume, 'ls' takes 20+ second with 1000+ files. On Wed, Nov 2, 2016 at 7:08 AM, Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote: > > > ----- Original Message ----- >> From: "Keiviw" <keiviw@xxxxxxx> >> To: gluster-devel@xxxxxxxxxxx >> Sent: Tuesday, November 1, 2016 12:41:02 PM >> Subject: [Gluster-devel] A question of GlusterFS dentries! >> >> Hi, >> In GlusterFS distributed volumes, listing a non-empty directory was slow. >> Then I read the dht codes and found the reasons. But I was confused that >> GlusterFS dht travesed all the bricks(in the volume) sequentially,why not >> use multi-thread to read dentries from multiple bricks simultaneously. >> That's a question that's always puzzled me, Couly you please tell me >> something about this??? > > readdir across subvols is sequential mostly because we have to support rewinddir(3). We need to maintain the mapping of offset and dentry across multiple invocations of readdir. In other words if someone did a rewinddir to an offset corresponding to earlier dentry, subsequent readdirs should return same set of dentries what the earlier invocation of readdir returned. For example, in an hypothetical scenario, readdir returned following dentries: > > 1. a, off=10 > 2. b, off=2 > 3. c, off=5 > 4. d, off=15 > 5. e, off=17 > 6. f, off=13 > > Now if we did rewinddir to off 5 and issue readdir again we should get following dentries: > (c, off=5), (d, off=15), (e, off=17), (f, off=13) > > Within a subvol backend filesystem provides rewinddir guarantee for the dentries present on that subvol. However, across subvols it is the responsibility of DHT to provide the above guarantee. Which means we should've some well defined order in which we send readdir calls (Note that order is not well defined if we do a parallel readdir across all subvols). So, DHT has sequential readdir which is a well defined order of reading dentries. > > To give an example if we have another subvol - subvol2 - (in addiction to the subvol above - say subvol1) with following listing: > 1. g, off=16 > 2. h, off=20 > 3. i, off=3 > 4. j, off=19 > > With parallel readdir we can have many ordering like - (a, b, g, h, i, c, d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir done parallely): > > 1. A complete listing of the directory (which can be any one of 10P1 = 10 ways - I hope math is correct here). > 2. Do rewinddir (20) > > We cannot predict what are the set of dentries that come _after_ offset 20. However, if we do a readdir sequentially across subvols there is only one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier to support rewinddir. > > If there is no POSIX requirement for rewinddir support, I think a parallel readdir can easily be implemented (which improves performance too). But unfortunately rewinddir is still a POSIX requirement. This also opens up another possibility of a "no-rewinddir-support" option in DHT, which if enabled results in parallel readdirs across subvols. What I am not sure is how many users still use rewinddir? If there is a critical mass which wants performance with a tradeoff of no rewinddir support this can be a good feature. > > +gluster-users to get an opinion on this. > > regards, > Raghavendra > >> >> >> >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@xxxxxxxxxxx >> http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users