comments are inline. On 04/22/2016 09:41 AM, Vijay Bellur wrote: > On Mon, Apr 18, 2016 at 3:28 AM, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> wrote: >> Hi All, >> >> Currently we are experiencing some issues with the implementation of >> readdirp in data tiering. >> >> Problem statement: >> >> When we do a readdirp, tiering reads entries only from cold tier. Since >> the hashed subvol for all files has been set as cold tier by default we >> will have all the files in cold tier. Some of them will be data files >> and remaining will be pointer files(T files), which points to original >> file in hot tier. The motivation behind this implementation was to >> increase the performance of readdir by only looking up entries in one >> tier. Also we ran into an issue where some files were not listed while >> using the default dht_readdirp. This is because dht_readdir reads >> entries from each subvol sequentially. Since tiering migrates files >> frequently this led to an issue where if a file was migrated off a >> subvol before the readdir got to it, but after the readdir had processed >> the target subvol, it would not show up in the listing [1]. >> >> So for the files residing in hot tier we will fallback to readdir i.e, >> we won't give stat for such entries to application. This is because the >> corresponding pointer file in cold tier won't be having a proper stat. >> So we forced fuse clients to do a explicit lookup/stat for such entries >> by setting nodeid as null. Similarly in case of native nfs, we marked >> such entries as stale stat by setting attributes_follow = FALSE. >> > Is the explicit lookup done by the kernel fuse module or is it done in > our bridge layer? it is an explicit lookup done by the kernel. > > Also does md-cache handle the case where nodeid is NULL in a readdirp response? if entry->inode set as null, reaaddirp won't cache that entry. > > > >> But the problem comes when we use gf_api, where we don't have any >> control over client behavior. So to fix this issue we have to give stat >> information for all the entries. >> > Apart from Samba, what other consumers of gfapi have this problem? In nfs-ganesha, What I understand is, they are not sending readdirp. So there we are good. But any other app which always expect a valid response from readdirp will fail. > > >> Possible solutions: >> 1. Revert the tier_readdirp to something similiar to dht_readdirp, then >> fix problem in [1]. >> 2. Have the tier readdirp do a lookup for every linkfile entry it finds >> and populate the data (which would cause a performance drop). This would >> mean that other translators do not need to be aware of the tier behaviour. >> 3. Do some sort of batched lookup in the tier readdirp layer to improve >> the performance. >> >> Both 2 and 3 won't give any performance benefit, but solve the problem >> in [1]. In fact this also not complete, because when we do the lookup >> (batched or single), by the time the file could have moved from the hot >> tier or vice versa which will again result in stale data. >> > Isn't this problem common with any of the solutions? Since tiering > keeps moving data without any of the clients being aware, any > attribute cache in the client stack can quickly go stale. That is right. > > >> 4. Revert to dht_readdirp and then instead of taking all entries from >> hot tier, just take only entries which has T file in cold tier. (We can >> delay deleting of data file after demotion, so that we will get the stat >> from hot tier) >> > Going by the architectural model of xlators, tier should provide the > right entries with attributes to the upper layers (xlators/vfs etc.). > Relying on a specific behavior from layers above us to mask a problem > in our layer does not seem ideal. I would go with something like 2 or > 3. If we want to retain the current behavior, we should make it > conditional as I am not certain that this behavior is foolproof too. If we make the changes in tier_readdirp, then it effects the performance of plane readdir (if md-cache was on). we may need to turn off volume option "performance.force-readdirp". What do you think here ? Rafi > > Thanks, > Vijay _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel