Re: rbd on EC pool with fast and extremely slow writes/reads

Andrej Filipcic <andrej.filipcic@xxxxxx> · Thu, 9 Mar 2023 11:35:58 +0100

Thanks for the hint, did run some short test, all fine. I am not sure 
it's a drive issue.

Some more digging, the file with bad performance has this segments:

[root@afsvos01 vicepa]# hdparm --fibmap $PWD/0

/vicepa/0:
filesystem blocksize 4096, begins at LBA 2048; assuming 512 byte sectors.
byte_offset  begin_LBA    end_LBA    sectors
          0     743232    2815039    2071808
 1060765696    3733064    5838279    2105216
 2138636288   70841232   87586575   16745344
10712252416   87586576   87635727      49152

Reading by segments:

# dd if=0 of=/tmp/0 bs=4M status=progress count=252
1052770304 bytes (1.1 GB, 1004 MiB) copied, 45 s, 23.3 MB/s
252+0 records in
252+0 records out

# dd if=0 of=/tmp/0 bs=4M status=progress skip=252 count=256
935329792 bytes (935 MB, 892 MiB) copied, 4 s, 234 MB/s
256+0 records in
256+0 records out

# dd if=0 of=/tmp/0 bs=4M status=progress skip=510
7885291520 bytes (7.9 GB, 7.3 GiB) copied, 12 s, 657 MB/s
2050+0 records in
2050+0 records out

So, 1st 1G is very slow, second segment is faster, then the rest quite 
fast, and it's reproducible (dropped caches before each dd)

Now, the rbd is 3TB with 256 pgs (EC 8+3), I checked with rados that 
objects are randomly distributed on pgs, eg

# rados --pgid 23.82 ls|grep rbd_data.20.2723bd3292f6f8
rbd_data.20.2723bd3292f6f8.0000000000000008
rbd_data.20.2723bd3292f6f8.000000000000000d
rbd_data.20.2723bd3292f6f8.00000000000001cb
rbd_data.20.2723bd3292f6f8.00000000000601b2
rbd_data.20.2723bd3292f6f8.000000000009001b
rbd_data.20.2723bd3292f6f8.000000000000005b
rbd_data.20.2723bd3292f6f8.00000000000900e8

where object ...05b for example corresponds to the 1st block of the file 
I am testing. Well, if my understanding of rbd  is correct: I assume 
that LBA regions are mapped to consecutive rbd objects.

So, now I am completely confused since the slow chunk of the file is 
still mapped to ~256 objects on different pgs....

Maybe I misunderstood the whole thing.

Any other hints? we will still do hdd tests on all the drives....

Cheers,
Andrej

On 3/6/23 20:25, Paul Mezzanini wrote:
When I have seen behavior like this it was a dying drive.  It only became obviously when I did a smart long test and I got failed reads.  Still reported smart OK though so that was a lie.

--

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

________________________________________
From: Andrej Filipcic<andrej.filipcic@xxxxxx>
Sent: Monday, March 6, 2023 8:51 AM
To: ceph-users
Subject:  rbd on EC pool with fast and extremely slow writes/reads

Hi,

I have a problem on one of ceph clusters I do not understand.
ceph 17.2.5 on 17 servers, 400 HDD OSDs, 10 and 25Gb/s NICs

3TB rbd image is on erasure coded 8+3 pool with 128pgs , xfs filesystem,
4MB objects in rbd image, mostly empy.

I have created a bunch of 10G files, most of them were written with
1.5GB/s, few of them were really slow, ~10MB/s, a factor of 100.

When reading these files back, the fast-written ones are read fast,
~2-2.5GB/s, the slowly-written are also extremely slow in reading, iotop
shows between 1 and 30 MB/s reading speed.

This does not happen at all on replicated images. There are some OSDs
with higher apply/commit latency, eg 200ms, but there are no slow ops.

The tests were done actually on proxmox vm with librbd, but the same
happens with krbd, and on bare metal with mounted krbd as well.

I have tried to check all OSDs for laggy drives, but they all look about
the same.

I have also copied entire image with "rados get...", object by object,
the strange thing here is that most of objects were copied within
0.1-0.2s, but quite some took more than 1s.
The cluster is quite busy with base traffic of ~1-2GB/s, so the speeds
can vary due to that. But I would not expect a factor of 100 slowdown
for some writes/reads with rbds.

Any clues on what might be wrong or what else to check? I have another
similar ceph cluster where everything looks fine.

Best,
Andrej

--
_____________________________________________________________
     prof. dr. Andrej Filipcic,   E-mail:Andrej.Filipcic@xxxxxx
     Department of Experimental High Energy Physics - F9
     Jozef Stefan Institute, Jamova 39, P.o.Box 3000
     SI-1001 Ljubljana, Slovenia
     Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail:Andrej.Filipcic@xxxxxx
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx