Hi, On Thu, 10 Aug 2023, Junio C Hamano wrote: > "Mark Ruvald Pedersen via GitGitGadget" <gitgitgadget@xxxxxxxxx> > writes: > > > +/* > > + * To accommodate common filesystem limitations, where the loose refs' file > > + * names must not exceed `NAME_MAX`, the labels generated by `git rebase > > + * --rebase-merges` need to be truncated if the corresponding commit subjects > > + * are too long. > > + * Add some margin to stay clear from reaching `NAME_MAX`. > > + */ > > +#define GIT_MAX_LABEL_LENGTH ((NAME_MAX) - (LOCK_SUFFIX_LEN) - 16) > > OK. Hopefully no systems defien NAME_MAX shorter than 20 bytes ;-). If there are, we already have problems with the following paths: #CHARS git_path --------------------------------- 20 BISECT_ANCESTORS_OK 20 BISECT_EXPECTED_REV 20 BISECT_FIRST_PARENT 22 fsmonitor--daemon.ipc 23 drop_redundant_commits 23 git-rebase-todo.backup 23 keep_redundant_commits 23 reschedule-failed-exec 24 allow_rerere_autoupdate 26 no-reschedule-failed-exec So I think we're good ;-) > We may suffix "-%d" to make it unique after this truncation, so > there definitely is a need for some slop, and 16-bytes should > sufficiently be long. > > > > @@ -5404,14 +5415,34 @@ static const char *label_oid(struct object_id *oid, const char *label, > > * > > * Note that we retain non-ASCII UTF-8 characters (identified > > * via the most significant bit). They should be all acceptable > > - * in file names. We do not validate the UTF-8 here, that's not > > - * the job of this function. > > + * in file names. > > + * > > + * As we will use the labels as names of (loose) refs, it is > > + * vital that the name not be longer than the maximum component > > + * size of the file system (`NAME_MAX`). We are careful to > > + * truncate the label accordingly, allowing for the `.lock` > > + * suffix and for the label to be UTF-8 encoded (i.e. we avoid > > + * truncating in the middle of a character). > > */ > > - for (; *label; label++) > > - if ((*label & 0x80) || isalnum(*label)) > > + for (; *label && buf->len + 1 < max_len; label++) > > + if (isalnum(*label) || > > + (!label_is_utf8 && (*label & 0x80))) > > strbuf_addch(buf, *label); > > + else if (*label & 0x80) { > > + const char *p = label; > > + > > + utf8_width(&p, NULL); > > + if (p) { > > + if (buf->len + (p - label) > max_len) > > + break; > > + strbuf_add(buf, label, p - label); > > + label = p - 1; > > + } else { > > + label_is_utf8 = 0; > > + strbuf_addch(buf, *label); > > + } > > Utf8_width() does let you advance one unicode character at a time as > its side effect, but it may be a bit overkill, as its primary > function is to compute the display width of that character. > > We could take advantage of the fact that the first byte of a UTF-8 > character has two high-bits set (i.e. 11xxxxxx) while the second and > subsequent bytes have only the top-bit set and the second highest > bit clear (i.e. 10xxxxxx) to simplify/optimize it. If this were in > a performance sensitive codepath, that is. It is not a performance-critical code path, so I erred on the side of simplicity (although I have to admit that the post image of the diff is not exactly for the faint of heart). Could we maybe form the plan to keep in the back of our heads that we already have a UTF-8-truncating functionality in sequencer, and in case another user should turn up, implemnt that optimized function in `utf8.[ch]`? > I'll queue it as-is for now, as we are in "regression fix only" > phase of the cycle, and have enough time to polish it. Thanks, Johannes