Den sön 28 jan. 2024 kl 23:02 skrev Adrian Sevcenco <Adrian.Sevcenco@xxxxxxx>: > > >> is it wrong to think of PGs like a kind of object bucket (S3 like)? > > > > Mostly, yes. > so .. in a PG there are no "file data" but pieces of "file data"? > so 100 GB file with 2x replication will be placed in more than 2 PGs? > Is there some kind of strips or chunks that a given data conglomerate is split into > and then _those_ are put in multiple PGs with the rule that they have to be in a Replicated=X ? PGs have no fixed size, they accept objects until the OSD(s) they lie on cannot accept more data. For repl=3, it means three full copies of each object on three separate PGs which are chose to be from three different OSDs, by default also on three different OSD hosts. That said, if you use RadosGW or RBD (for VM disk images for example), those two will split your large "files" into many smaller objects, at sizes like 2M or 4M each. So if you upload a 10G S3 object to RadosGW or have your virtualization platform assign a 10G "disk" to a VM, that 10G ceph object will not be in one PG, it will be made into 2-4M pieces and those will have different names like 10G_image.1001, 10G_image.1002 which means the pieces will spread out on many different PGs. This has the benefit of making it possible to store really huge images on many small drives, and also spread load since the VM might itself ask for IO to be done on several places on its disk, and these would be served by different drives in your ceph cluster. If the VM image ended up as a single object in a single PG, then it would mean all IO for it would always hit this one drive and you would get really bad performance which would be worse than if the hypervisor had this drive locally attached. So for most usage as a normal user, the libraries underneath will do the object splitting for you and it will be totally transparent to you but depending on how you use the cluster and at which level you are looking, those final objects in ceph do not get split again by the cluster itself, unless you are using Erasure Coding. With rados calls one can make 100G objects, but most usage goes via librbd or radosgw or something on top of rados and hence it does the splitting for you. > >> if so, is the size of PG the limit of a file size? (for replicated case)? Mostly no. The corner case here is if you have OSDs with very different amounts of free space. This is mostly visible in reports like "ceph df" because it will say a pool (made up of PGs) has a max-free based on the OSD with least free space, multiplied by number-of-OSDs in the pool. This is because the PG placement is semi-random and depending on what you write, it could happen that all data goes into the OSD with the least amount of free space every time, and then the pool goes full long before the largest OSD is filled. The fewer OSDs you have and the larger the difference in sizes of those OSDs you get, the worse the situation becomes. As an example, lets say you have a pool that is told by crush to end up on these three OSD hosts with one drive each: HostA: 10TB free HostB: 10TB free HostC: 10G free In this case "ceph df" will say "you should consider this pool to have (raw) 30G free at most" and if you have repl=3 on this pool then it can only take 10G data before either primary data or one of the two copies for the data written will make HostC go full. This situation makes "ceph df" tell you weird numbers, but also the most truthful ones, since they tell you what your clients will experience. If you end up in a situation like the above, you can also notice that if any OSD breaks, hostC has no chance of hosting extra copies during repairs, since it will go full almost immediately, when it is unbalanced like this, your redundancy also suffers. If your cluster is holding important data, you would make sure there are most osd hosts than the replication factor so it can repair copies onto another host when one host dirs, and since there is soft limits at 85, 90 and 95% OSD capacity that stops certain fileops from driving the OSDs totally full, you would want to keep the cluster at 50-60% at most for small clusters so that if one host dies, all of its data can fit onto the others without passing the first 85% limit while still allowing all PGs to hold three copies in total. If you have many OSD hosts, each one represents a small percentage of the total, and it will be easier for a cluster to handle a host failure using space from all the others to rebuild, but then at some point, having 100+ OSD hosts or more would make chances larger that any one host is experiencing planned or unplanned maintenance. So while PGs don't have fixed size limits, they do have a current size, and when moving or balancing data, ceph moves PGs around, not files or objects. That is why it is preferred to have something like 100+ PGs per OSD, so that ceph can balance a part of data at a time that would be around 1% of the disk capacity and not just the whole OSD back and forth. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx