Re: High apply latency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jakub,

Le 05/02/2018 à 12:26, Jakub Jaszewski a écrit :
Hi Frederic, 

Many thanks for your contribution to the topic!

I've just set logging level 20 for filestore via 

ceph tell osd.0 config set debug_filestore 20

but so far
​found
 nothing by keyword 'split'
​ in ​/var/log/ceph/ceph-osd.0.log


So, if you're running ceph > 12.2.1, that means splitting is not happening. Did you check during writes ? Did you check other OSDs logs ?

Actually, splitting should not happen now that you've increased ​​filestore_merge_threshold and filestore_split_multiple values.


​I've also run your script across the cluster nodes, results as follows

id=3, pool=volumes, objects=10454548, avg=160.28
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.2344
id=3, pool=volumes, objects=10454548, avg=159.22
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.9994
id=3, pool=volumes, objects=10454548, avg=159.843
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7435
id=3, pool=volumes, objects=10454548, avg=159.695
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.0579
id=3, pool=volumes, objects=10454548, avg=160.594
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7757
id=3, pool=volumes, objects=10454548, avg=160.099
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=33.8517
id=3, pool=volumes, objects=10454548, avg=159.912
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=37.5698
id=3, pool=volumes, objects=10454548, avg=159.407
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.4991
id=3, pool=volumes, objects=10454548, avg=160.075
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.481

Looks like there is nothing to be handled by split, am I right? But what about merging ? Avg is less than 40
 ​, should directories structure be reduced now?

It should, I guess. But then you'd see blocked requests on every object deletion. If you do, you might want to set ​​filestore_merge_threshold to -40 (negative value) so merging does not happen anymore.
Splitting would still happen over 5120 files per subdirectory.


    "
​​
filestore_merge_threshold": "40",
    "filestore_split_multiple": "8",
    "filestore_split_rand_factor": "20",

M
​ay I ask for the link to documentation where I can read more about OSD underlying directory structure?

I'm not aware of any related documentation.

Do you still observe slow or blocked requests now that you've increased ​​filestore_merge_threshold and filestore_split_multiple ?

Regards,

Frédéric.



​And just noticed log entries in  /var/log/ceph/ceph-osd.0.log

​2018-02-05 11:22:03.346400 7f3cc94fe700  0 -- 10.212.14.11:6818/4702 >> 10.212.14.17:6802/82845 conn(0xe254cca800 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 27 vs existing csq=27 existing_state=STATE_STANDBY
2018-02-05 11:22:03.346583 7f3cc94fe700  0 -- 10.212.14.11:6818/4702 >> 10.212.14.17:6802/82845 conn(0xe254cca800 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 28 vs existing csq=27 existing_state=STATE_STANDBY


M
any thanks!​
​Jakub​

On Mon, Feb 5, 2018 at 9:56 AM, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:

Hi,

In addition, starting with Luminous 12.2.1 (RHCS 3), splitting ops should be loggued with default setting of debug level messages: https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320
There's also a RFE for merging to be loggued as well as splitting: https://bugzilla.redhat.com/show_bug.cgi?id=1523532

Regards,

Frédéric.


Le 02/02/2018 à 17:00, Frédéric Nass a écrit :

Hi,

Split and merge operations happen during writes only, splitting on file creation and merging on file deletion.

As you don't see any blocked requests during reads I would guess your issue happens during splitting. Now that you increased filestore_merge_threshold and filestore_split_multiple, you shouldn't expect any splitting operations to happen any soon, nor any merging operations, unless your workload consists of writing a huge number of files and removing them.

You should check how many files are in each lower directories of pool 20's PGs. This would help to confirm that the blocked requests come with the splitting.

We now use the below script (on one of the OSD nodes) to get an average value of the number of files in some PGs of each pool and run this script every 5 minutes with Munin to get a graph out of the values.
This way, we can anticipate the splitting and even provoke the splitting during off hours with rados bench (prefix) :

#!/bin/bash

for pool in $(ceph osd pool ls detail | grep -v -E 'snap|^$' | cut -d " " -f2-3 | sed -e 's/ /|/g') ; do

pool_id=$(echo $pool | cut -d "|" -f1)
pool_name=$(echo $pool | cut -d "|" -f2 | sed -e "s/'//g")
nbr_objects=$(rados df | grep "$pool_name " | awk '{print $3}')
#echo "$pool_id, $pool_name |$nbr_objects|"

ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head > /dev/null 2>&1

#echo $?
#continue

if [ $? -eq 0 ]; then
avg=$(for pg_dir in $(ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head | tail -5) ; do find $pg_dir -type d -print0 | while read -d '' -r dir; do files=`find $dir -type f -maxdepth 1|wc -l`; printf "$files in $dir\n";done | grep -v '^0 ' | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' ; done | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }')
else
avg=0
fi

echo "id=$pool_id, pool=$pool_name, objects=$nbr_objects, avg=$avg"

done

With 40/8 values, subdirectory splitting will happen when the number of files goes over 5120 and merging when the number of files goes below 40. My assumption is that merging can also happen to intermediate directories when then go below 40 files, but I'm not sure about that.

If you ever need to provoke splitting on a pool, you can do this with a rados bench (don't forget the prefix so you can easily remove the files aftewards if it turns out that some were left in the pool).
You can also do the splitting offline if you need it:

systemctl stop ceph-osd@${osd_num}
ceph-osd -i ${osd_num} --flush-journal
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osd_num} --journal-path /var/lib/ceph/osd/ceph-${osd_num}/journal --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op apply-layout-settings --pool ${pool_name}

Regards,

Frederic.


Le 02/02/2018 à 09:55, Jakub Jaszewski a écrit :
Hi,

So I have changed merge & split settings to
 
filestore_merge_threshold = 40
filestore_split_multiple = 8

and restart all OSDs , host by host.

Let me ask a question, although the pool default.rgw.buckets.data that was affected prior to the above change has higher write bandwidth it is very random now. Writes are random for other pools (same for EC and replicated types) too, before the change writes to replicated pools were much more stable.
Reads from pools look fine and stable.

Is it the result of mentioned change ? Is PG directory structure updating or ...?

[...]

Regards
Jakub

On Thu, Feb 1, 2018 at 3:33 PM, Jakub Jaszewski <jaszewski.jakub@xxxxxxxxx> wrote:
Regarding split & merge, I have default values
filestore_merge_threshold = 10
filestore_split_multiple = 2

according to https://bugzilla.redhat.com/show_bug.cgi?id=1219974 the recommended values are

filestore_merge_threshold = 40
filestore_split_multiple = 8

Is it something that I can easily change to default or lower values than proposed in case of further performance degradation ?

I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC  pools (k=6,m=3)

The pool with the lowest bandwidth has osd tree structure like 
├── 20.115s1_head
│   └── DIR_5
│       └── DIR_1
│           ├── DIR_1
│           │   ├── DIR_0
│           │   ├── DIR_1
│           │   ├── DIR_2
│           │   │   ├── DIR_0
│           │   │   ├── DIR_1
│           │   │   ├── DIR_2
│           │   │   ├── DIR_3
│           │   │   ├── DIR_4
│           │   │   ├── DIR_5
│           │   │   ├── DIR_6
│           │   │   ├── DIR_7
│           │   │   ├── DIR_8
│           │   │   ├── DIR_9
│           │   │   ├── DIR_A
│           │   │   ├── DIR_B
│           │   │   ├── DIR_C
│           │   │   ├── DIR_D
│           │   │   ├── DIR_E
│           │   │   └── DIR_F 


Tests results

# rados bench -p default.rgw.buckets.data 10 write 
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size 4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180679
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       129       113   451.975   452.014   0.0376714    0.128027
    2      16       209       193   385.964    320.01    0.119609    0.138517
    3      16       235       219   291.974   104.003   0.0337624     0.13731
    4      16       235       219   218.981         0           -     0.13731
    5      16       266       250   199.983   62.0019    0.111673    0.238424
    6      16       317       301   200.649   204.006   0.0340569    0.298489
    7      16       396       380   217.124    316.01   0.0379956    0.283458
    8      16       444       428   213.981   192.006   0.0304383    0.274193
    9      16       485       469   208.426   164.005    0.391956    0.283421
   10      16       496       480   191.983   44.0013    0.104497    0.292074
   11      16       497       481   174.894   4.00012    0.999985    0.293545
   12      16       497       481    160.32         0           -    0.293545
   13      16       497       481   147.987         0           -    0.293545
   14      16       497       481   137.417         0           -    0.293545
Total time run:         14.493353
Total writes made:      497
Write size:             4194432
Object size:            4194432
Bandwidth (MB/sec):     137.171
Stddev Bandwidth:       147.001
Max bandwidth (MB/sec): 452.014
Min bandwidth (MB/sec): 0
Average IOPS:           34
Stddev IOPS:            36
Max IOPS:               113
Min IOPS:               0
Average Latency(s):     0.464281
Stddev Latency(s):      1.09388
Max latency(s):         6.3723
Min latency(s):         0.023835
Cleaning up (deleting benchmark objects)
Removed 497 objects
Clean up completed and total clean up time :10.622382
#


# rados bench -p benchmark_erasure_coded 10 write 
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of size 4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180807
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       424       408   1635.11   1635.19   0.0490434   0.0379616
    2      16       828       812   1627.03   1619.16   0.0616501   0.0388467
    3      16      1258      1242   1659.06   1723.36   0.0304412   0.0384537
    4      16      1659      1643   1646.03   1607.13   0.0155402   0.0387351
    5      16      2053      2037   1632.61   1579.08   0.0453354   0.0390236
    6      16      2455      2439      1629   1611.14   0.0485313   0.0392376
    7      16      2649      2633   1507.34   777.516   0.0148972   0.0393161
    8      16      2858      2842   1423.61   837.633   0.0157639   0.0449088
    9      16      3245      3229   1437.75   1551.02   0.0200845   0.0444847
   10      16      3629      3613   1447.85      1539   0.0654451   0.0441569
Total time run:         10.229591
Total writes made:      3630
Write size:             4202496
Object size:            4202496
Bandwidth (MB/sec):     1422.18
Stddev Bandwidth:       341.609
Max bandwidth (MB/sec): 1723.36
Min bandwidth (MB/sec): 777.516
Average IOPS:           354
Stddev IOPS:            85
Max IOPS:               430
Min IOPS:               194
Average Latency(s):     0.0448612
Stddev Latency(s):      0.0712224
Max latency(s):         1.08353
Min latency(s):         0.0134629
Cleaning up (deleting benchmark objects)
Removed 3630 objects
Clean up completed and total clean up time :2.321669



# rados bench -p volumes 10 write 
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180651
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       336       320   1279.89      1280   0.0309006   0.0472524
    2      16       653       637   1273.84      1268   0.0465151   0.0495701
    3      16       956       940   1253.17      1212   0.0337327   0.0504146
    4      16      1256      1240   1239.85      1200   0.0177263   0.0509145
    5      16      1555      1539   1231.05      1196   0.0364991   0.0516724
    6      16      1868      1852   1234.51      1252   0.0260964   0.0510236
    7      16      2211      2195   1254.13      1372    0.040738    0.050847
    8      16      2493      2477   1238.35      1128   0.0228582   0.0514979
    9      16      2838      2822   1254.07      1380   0.0265224   0.0508641
   10      16      3116      3100   1239.85      1112   0.0160151   0.0513104
Total time run:         10.192091
Total writes made:      3117
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1223.3
Stddev Bandwidth:       89.9383
Max bandwidth (MB/sec): 1380
Min bandwidth (MB/sec): 1112
Average IOPS:           305
Stddev IOPS:            22
Max IOPS:               345
Min IOPS:               278
Average Latency(s):     0.0518144
Stddev Latency(s):      0.0529575
Max latency(s):         0.663523
Min latency(s):         0.0122169
Cleaning up (deleting benchmark objects)
Removed 3117 objects
Clean up completed and total clean up time :0.212296



# rados bench -p benchmark_replicated  10 write 
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180779
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       309       293   1171.94      1172   0.0233267   0.0508877
    2      16       632       616   1231.87      1292   0.0258237    0.049612
    3      16       959       943   1257.19      1308   0.0335615   0.0483464
    4      16      1276      1260   1259.85      1268    0.031461   0.0504689
    5      16      1643      1627   1301.44      1468   0.0274032   0.0489651
    6      16      1991      1975   1316.51      1392   0.0408116   0.0483596
    7      16      2328      2312   1320.98      1348   0.0242298    0.048175
    8      16      2677      2661   1330.33      1396    0.097513    0.047962
    9      16      3042      3026   1344.72      1460   0.0196724   0.0474078
   10      16      3384      3368   1347.03      1368   0.0426199   0.0472573
Total time run:         10.482871
Total writes made:      3384
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1291.25
Stddev Bandwidth:       90.4861
Max bandwidth (MB/sec): 1468
Min bandwidth (MB/sec): 1172
Average IOPS:           322
Stddev IOPS:            22
Max IOPS:               367
Min IOPS:               293
Average Latency(s):     0.048763
Stddev Latency(s):      0.0547666
Max latency(s):         0.938211
Min latency(s):         0.0121556
Cleaning up (deleting benchmark objects)
Removed 3384 objects
Clean up completed and total clean up time :0.239684
#



Luis why did you advise against increasing pg_num pgp_num ? I'm wondering which option is better: increasing pg_num or filestore_merge_threshold and filestore_split_multiple ?

Thanks
Jakub


On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski <jaroslaw.owsiewski@xxxxxxxxxx> wrote:
Hi,

maybe "split  is on the floor"?

Regards
-- 
Jarek




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux