Prefixes

Kobo dictionaries are sharded by a prefix derived from the headword.

The information in this document is based on reverse engineering DictionaryParser::htmlForWord.

Note: Kobo will only look in the file matching the word’s prefix, so if a variant has a different prefix, it must be duplicated into each matching file (note that duplicate words aren’t an issue).

Note: This document only covers the algorithm used for non-Japanese (Kanji) dictionaries.

Prefix algorithm

Prefixes are calculated using the following steps. Note that “character” refers to a single Unicode code point, not a byte.

Trim the word at the first null byte, if any (i.e. treat it as a C string).
Discard everything but the first two characters.
Convert the characters to lowercase using the Unicode case mapping rules.
Trim all whitespace characters on the left and right sides.
If the string is empty, return “11”.
If the first of the remaining characters is in the Unicode Cyrillic character class, return them as-is.
Right-pad the remaining characters to 2 characters long using “a”s.
If either of the first two characters are not in the Unicode Letter character class, return “11”.
Return the characters as-is.

Examples

Word	Prefix	Notes
“`test`”	“`te`”
“`a`”	“`aa`”
“`Èe`”	“`èe`”	The word is made lowercase using unicode rules (i.e. accented characters are included).
“`multiple words`”	“`mu`”
“`àççèñts`”	“`àç`”
“`à`”	“`àa`”
“`ç`”	“`ça`”
””	“`11`”
” ”	“`11`”	Space trimming is done after taking the first 2 characters.
” `x`”	“`xa`”
” `123`”	“`11`”
“`x 23`”	“`xa`”
“`д` ”	“`д`”	“д” is a Cyrillic character, and it’s the first character of the word (after trimming spaces), so it isn’t padded with “a”s.
“`дaд`”	“`дa`”
“`未未`”	“`未未`”
“`未`”	“`未a`”	Even though “未” is a two-byte character, it is a single unicode rune (and the characters are counted, not bytes).
” `未`”	“`11`”	Space trimming is done after taking the first 2 characters.
” `未`”	“`未a`”	The two-byte “未” character isn’t split up when taking the first 2 characters.

Testing

You can test Kobo’s prefix algorithm directly using dictword-test.

If you just want an easy way to generate prefixes for words, use the dictutil prefix command

Sample implementation

Here is the Go implementation used in dictutil:

func WordPrefix(word string) string {
	pfx := []rune(word)

	for i, c := range pfx {
		if i >= 2 || c == '\x00' { // limit to 2 chars, also cut at null
			pfx = pfx[:i] // trim up to current char
			break
		}
		pfx[i] = unicode.ToLower(c) // this includes accented chars
	}

	for len(pfx) != 0 {
		if unicode.IsSpace(pfx[0]) {
			pfx = pfx[1:] // trim left space
		} else {
			break
		}
	}

	for len(pfx) != 0 {
		if unicode.IsSpace(pfx[len(pfx)-1]) {
			pfx = pfx[:len(pfx)-1] // trim right space
		} else {
			break
		}
	}

	if len(pfx) == 0 {
		return "11" // if empty, return "11"
	}

	if !unicode.Is(unicode.Cyrillic, pfx[0]) {
		for len(pfx) < 2 {
			pfx = append(pfx, 'a') // pad right with 'a's to 2 chars
		}
		if !unicode.IsLetter(pfx[0]) || !unicode.IsLetter(pfx[1]) {
			return "11" // if either of the first 2 chars are letters, return "11"
		}
	}

	return string(pfx)
}