Hi. At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <andreas@xxxxxxxxxx> wrote in <VisenaEmail.3.8750116fce15432e.16d1c0b2b28@tc7-visena> > På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@xxxxxxxxxxxxx > <mailto:tgl@xxxxxxxxxxxxx>>: Jimmy Huang <jimmy_huang@xxxxxxxx> writes: > > I tried pg_trgm and my own customized token parser > https://github.com/huangjimmy/pg_cjk_parser > > pg_trgm is going to be fairly useless for indexing text that's mostly > multibyte characters, since its unit of indexable data is just 3 bytes > (not characters). I don't know of any comparable issue in the core > tsvector logic, though. The numbers you're quoting do sound quite awful, > but I share Cory's suspicion that it's something about your setup rather > than an inherent Postgres issue. > > regards, tom lane We experienced quite awful performance when we hosted the > DB on virtual servers (~5 years ago) and it turned out we hit the write-cache > limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might > help tracing down IO-problems. -- > Andreas Joseph Krogh Multibyte characters also quickly bloats index by many many small buckets for every 3-characters combination of thouhsand of characters, which makes it useless. pg_bigm based on bigram/2-gram works better on multibyte characters. https://pgbigm.osdn.jp/index_en.html regards. -- Kyotaro Horiguchi NTT Open Source Software Center