Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

James Keener <jim@xxxxxxxxxxxxx> · Tue, 12 Dec 2017 10:11:10 -0500

The default C locale on Linux (I don't know Windows) will sort "digits",
 then alphabetic with the lower then upper case of each letter in order 
like: "aAbB...zZ"

That's no true at all! The C locales are 0-9A-Za-z

#include <locale.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static int myCompare (const void * a, const void * b)
{
  return strcmp (*(const char **) a, *(const char **) b);
}

void with_other_locale (char *new_locale,
    const char **arr,
    const int n
    )
{
  char *old_locale, *saved_locale;

  /* Get the name of the current locale.  */
  old_locale = setlocale (LC_ALL, NULL);

  /* Copy the name so it won’t be clobbered by setlocale. */
  saved_locale = strdup (old_locale);

  if (strlen(new_locale) == 0)
  {
    new_locale = saved_locale;
  }

  /* Now change the locale and do some stuff with it. */
  setlocale (LC_ALL, new_locale);
  qsort (arr, n, sizeof (const char *), myCompare);

  printf("\nSorted array in locale %s is\n", new_locale);
  for (int i = 0; i < n; i++)
    printf("%d: %s \n", i, arr[i]);

  /* Restore the original locale. */
  setlocale (LC_ALL, saved_locale);
  free (saved_locale);
}

int main ()
{
  const char *arr[] = {"Jim", "job", "Anne", "aardvark", "Isaac", "island", "12 Days of Christmas", "12 drummers"};
  int n = sizeof(arr)/sizeof(arr[0]);
  int i;

  printf("Given array is\n");
  for (i = 0; i < n; i++)
    printf("%d: %s \n", i, arr[i]);

  with_other_locale("", arr, n);
  with_other_locale("C", arr, n);
  with_other_locale("en_US.UTF-8", arr, n);
  with_other_locale("UTF-8", arr, n);
  return 0;
}

Gives

Given array is
0: Jim 
1: job 
2: Anne 
3: aardvark 
4: Isaac 
5: island 
6: 12 Days of Christmas 
7: 12 drummers 

Sorted array in locale C is
0: 12 Days of Christmas 
1: 12 drummers 
2: Anne 
3: Isaac 
4: Jim 
5: aardvark 
6: island 
7: job 

Sorted array in locale C is
0: 12 Days of Christmas 
1: 12 drummers 
2: Anne 
3: Isaac 
4: Jim 
5: aardvark 
6: island 
7: job 

Sorted array in locale en_US.UTF-8 is
0: 12 Days of Christmas 
1: 12 drummers 
2: Anne 
3: Isaac 
4: Jim 
5: aardvark 
6: island 
7: job 

Sorted array in locale UTF-8 is
0: 12 Days of Christmas 
1: 12 drummers 
2: Anne 
3: Isaac 
4: Jim 
5: aardvark 
6: island 
7: job 

I actually don't think there is a case-insensitive locale (bydefault?) on a unix machine.

Jim

On Tue, Dec 12, 2017 at 8:18 AM, John McKown <john.archie.mckown@xxxxxxxxx> wrote:
On Tue, Dec 12, 2017 at 2:17 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:
Hi Laurenz, Tom, Peter,

Thanks for your suggestions.  The practical solution seems to be to override comparison operators of char, varchar and text data types with UDFs that behave as Tom mentioned.

From: Peter Geoghegan [mailto:pg@xxxxxxx]

> That said, the idea of an "EBCDIC collation" seems limiting. Why

> should a system like DB2 for the mainframe (that happens to use EBCDIC

> as its encoding) not have a more natural, human-orientated collation

> even while using EBCDIC? ISTM that the point of using the "C" locale

> (with EBDIC or with UTF-8 or with any other encoding) is to get a

> performance benefit where the actual collation's behavior doesn't

> matter much to users. Are you sure it's really important to be

> *exactly* compatible with EBCDIC order? As long as you're paying for a

> custom collation, why not just use a collation that is helpful to

> humans?

You are right.  I'd like to ask the customer whether and why they need EBCDIC ordering.

This is a guess on my part, based on many years on an EBCDIC system. But I'll bet that they are doing a conversion off of the EBCDIC system (maybe Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They want to be able to compare the output from the existing system to the output on the new system. EBCDIC orders "lower case", "upper case", then "digits". The default C locale on Linux (I don't know Windows) will sort "digits", then alphabetic with the lower then upper case of each letter in order like: "aAbB...zZ". Comparing identical data which is not presented in exactly the same order would be very difficult. 

Regards

Takayuki Tsunakawa

-- 
I have a theory that it's impossible to prove anything, but I can't prove it.

Maranatha! <><
John McKown