Re: rbd directory listing performance issues

Shain Miley <SMiley@xxxxxxx> · Tue, 6 Jan 2015 23:11:06 +0000

Robert,

Thanks again for the help.

I'll keep looking around.  However as you stated it might be a matter of trying to reduce OSD latency,  instead of trying to find tuning option on the client.

I've already increased the readahead values, played with the scheduler, and mount options...so I'm running out of mount options.

I may try rbd cache at some point.

I'll let you know if I uncover anything worthwhile.

Shain

Sent from my iPhone

> On Jan 6, 2015, at 2:59 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> 
> I think your free memory is just fine. If you have lots of data change
> (read/write) then I think it is just aging out your directory cache.
> If fast directory listing is important to you, you can always write a
> script to periodically read the directory listing so it stays in cache
> or use http://lime-technology.com/forum/index.php?topic=4500.0.
> Otherwise you are limited to trying to reduce the latency in your Ceph
> environment for small block sizes. We have tweaked the RBD cache and
> added an SSD caching layer (on Giant) and it has helped some, but
> nothing spectacular. There have been references that increasing the
> readahead on RBD to 4M helps, but it didn't do anything for us.
> 
>> On Tue, Jan 6, 2015 at 12:18 PM, Shain Miley <SMiley@xxxxxxx> wrote:
>> It does seem like the entries get cached for a certain period of time.
>> 
>> Here is the memory listing for the rbd client server:
>> 
>> root@cephmount1:~# free -m
>>             total       used       free     shared    buffers     cached
>> Mem:         11965      11816        149          3        139      10823
>> -/+ buffers/cache:        853      11112
>> Swap:         4047          0       4047
>> 
>> I can add more memory to the server if I need to I have 2 or 4 16GB DIMM laying around here someplace.
>> 
>> 
>> Here are the some of the pagecache sysctl settings:
>> vm.dirty_background_bytes = 0
>> vm.dirty_background_ratio = 10
>> vm.dirty_bytes = 0
>> vm.dirty_expire_centisecs = 3000
>> vm.dirty_ratio = 10
>> vm.dirty_writeback_centisecs = 500
>> 
>> 
>> In terms of the number of files:
>> 
>> root@cephmount1:/mnt/ceph-block-device-archive/library/E# time ls
>> real    0m8.073s
>> user    0m0.000s
>> sys     0m0.012s
>> 
>> root@cephmount1:/mnt/ceph-block-device-archive/library/E# ls |wc
>>    228     510    3413
>> 
>> 
>> However looking at some other directories...I see numbers in the range of 500 and 600, etc...so they will vary based on the name of the artist..however if I had to guess we would not use any more than 800 - 1000 in the very heavy directories at this point.
>> 
>> Also...one thing I just noticed is that the 'ls |wc' returns right away...even in cases when right after that I do an 'ls -l' and it takes a while.
>> 
>> Thanks,
>> 
>> Shain
>> 
>> Shain Miley | Manager of Systems and Infrastructure, Digital Media | smiley@xxxxxxx | 202.513.3649
>> 
>> ________________________________________
>> From: Robert LeBlanc [robert@xxxxxxxxxxxxx]
>> Sent: Tuesday, January 06, 2015 1:57 PM
>> To: Shain Miley
>> Cc: ceph-users@xxxxxxxx
>> Subject: Re:  rbd directory listing performance issues
>> 
>> I would think that the RBD mounter would cache the directory listing
>> which should always make it fast, unless there is so much memory
>> pressure that it is dropping it frequently.
>> 
>> How many entries are in your directory and total on the RBD?
>> ls | wc -l
>> find . | wc -l
>> 
>> What does your memory look like?
>> free -h
>> 
>> I'm not sure now much help I can be, but if memory pressure is causing
>> buffers to be freed, then it can cause the system to have to go disk
>> to get the directory listing. I'm guessing that if the directory is
>> large enough it could cause the system to have to go back to the RBD
>> many times. Very small I/O on RBD is very expensive compared to big
>> sequential access.
>> 
>>> On Tue, Jan 6, 2015 at 11:33 AM, Shain Miley <SMiley@xxxxxxx> wrote:
>>> Robert,
>>> 
>>> xfs on the rbd image as well:
>>> 
>>> /dev/rbd0 on /mnt/ceph-block-device-archive type xfs (rw)
>>> 
>>> However looking at the mount options...it does not look like I've enabled anything special in terms of mount options.
>>> 
>>> Thanks,
>>> 
>>> Shain
>>> 
>>> 
>>> Shain Miley | Manager of Systems and Infrastructure, Digital Media | smiley@xxxxxxx | 202.513.3649
>>> 
>>> ________________________________________
>>> From: Robert LeBlanc [robert@xxxxxxxxxxxxx]
>>> Sent: Tuesday, January 06, 2015 1:27 PM
>>> To: Shain Miley
>>> Cc: ceph-users@xxxxxxxx
>>> Subject: Re:  rbd directory listing performance issues
>>> 
>>> What fs are you running inside the RBD?
>>> 
>>>> On Tue, Jan 6, 2015 at 8:29 AM, Shain Miley <SMiley@xxxxxxx> wrote:
>>>> Hello,
>>>> 
>>>> We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up of 107 x
>>>> 4TB drives formatted with xfs. The cluster is running ceph version 0.80.7:
>>>> 
>>>> Cluster health:
>>>> cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>>>>     health HEALTH_WARN crush map has legacy tunables
>>>>     monmap e1: 3 mons at
>>>> {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0},
>>>> election epoch 156, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>>>>     osdmap e19568: 107 osds: 107 up, 107 in
>>>>      pgmap v10117422: 2952 pgs, 15 pools, 77202 GB data, 19532 kobjects
>>>>            226 TB used, 161 TB / 388 TB avail
>>>> 
>>>> Relevant ceph.conf entries:
>>>> osd_journal_size = 10240
>>>> filestore_xattr_use_omap = true
>>>> osd_mount_options_xfs =
>>>> "rw,noatime,nodiratime,logbsize=256k,logbufs=8,inode64"
>>>> osd_mkfs_options_xfs = "-f -i size=2048"
>>>> 
>>>> 
>>>> A while back I created an 80 TB rbd image to be used as an archive
>>>> repository for some of our audio and video files. We are still seeing good
>>>> rados and rbd read and write throughput performance, however we seem to be
>>>> having quite a long delay in response times when we try to list out the
>>>> files in directories with a large number of folders, files, etc.
>>>> 
>>>> Subsequent directory listing times seem to run a lot faster (but I am not
>>>> sure for long that is the case before we see another instance of slowness),
>>>> however the initial directory listings can take 20 to 45 seconds.
>>>> 
>>>> The rbd kernel client is running on ubuntu 14.04 using kernel version
>>>> '3.18.0-031800-generic'.
>>>> 
>>>> Benchmarks:
>>>> 
>>>> root@rbdmount1:/mnt/rbd/music_library/D#time ls (file names removed):
>>>> real    0m18.045s
>>>> user    0m0.000s
>>>> sys    0m0.011s
>>>> 
>>>> root@rbdmount1:/mnt/rbd# dd bs=1M count=1024 if=/dev/zero of=test
>>>> conv=fdatasync
>>>> 1024+0 records in
>>>> 1024+0 records out
>>>> 1073741824 bytes (1.1 GB) copied, 9.94287 s, 108 MB/s
>>>> 
>>>> 
>>>> My questions are:
>>>> 
>>>> 1) Is there anything inherent in our setup/configuration that would prevent
>>>> us from having fast directory listings on these larger directories (using an
>>>> rbd image of that size for example)?
>>>> 
>>>> 2) Have there been any changes made in Giant that would warrant upgrading
>>>> the cluster a a fix to resolve this issue?
>>>> 
>>>> Any suggestions would be greatly appreciated.
>>>> 
>>>> Thanks,
>>>> 
>>>> Shain
>>>> 
>>>> 
>>>> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
>>>> smiley@xxxxxxx | 202.513.3649
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com