Class BOCSU

java.lang.Object
com.ibm.icu.impl.coll.BOCSU

public class BOCSU extends Object

Binary Ordered Compression Scheme for Unicode

Users are strongly encouraged to read the ICU paper on BOCU before attempting to use this class.

BOCU is used to compress unicode text into a stream of unsigned bytes. For many kinds of text the compression compares favorably to UTF-8, and for some kinds of text (such as CJK) it does better. The resulting bytes will compare in the same order as the original code points. The byte stream does not contain the values 0, 1, or 2.

One example of a use of BOCU is in com.ibm.icu.text.Collator#getCollationKey(String) for a RuleBasedCollator object with collation strength IDENTICAL. The result CollationKey will consist of the collation order of the source string followed by the BOCU result of the source string.

Unlike a UTF encoding, BOCU-compressed text is not suitable for random access.

Method: Slope Detection
Remember the previous code point (initial 0). For each code point in the string, encode the difference with the previous one. Similar to a UTF, the length of the byte sequence is encoded in the lead bytes. Unlike a UTF, the trail byte values may overlap with lead/single byte values. The signedness of the difference must be encoded as the most significant part.

We encode differences with few bytes if their absolute values are small. For correct ordering, we must treat the entire value range -10ffff..+10ffff in ascending order, which forbids encoding the sign and the absolute value separately. Instead, we split the lead byte range in the middle and encode non-negative values going up and negative values going down.

For very small absolute values, the difference is added to a middle byte value for single-byte encoded differences. For somewhat larger absolute values, the difference is divided by the number of byte values available, the modulo is used for one trail byte, and the remainder is added to a lead byte avoiding the single-byte range. For large absolute values, the difference is similarly encoded in three bytes. (Syn Wee, I need examples here.)

BOCU does not use byte values 0, 1, or 2, but uses all other byte values for lead and single bytes, so that the middle range of single bytes is as large as possible.

Note that the lead byte ranges overlap some, but that the sequences as a whole are well ordered. I.e., even if the lead byte is the same for sequences of different lengths, the trail bytes establish correct order. It would be possible to encode slightly larger ranges for each length (>1) by subtracting the lower bound of the range. However, that would also slow down the calculation. (Syn Wee, need an example).

For the actual string encoding, an optimization moves the previous code point value to the middle of its Unicode script block to minimize the differences in same-script text runs. (Syn Wee, need an example.)

Since:
release 2.2, May 3rd 2002
  • Field Details

    • SLOPE_MIN_

      private static final int SLOPE_MIN_
      Do not use byte values 0, 1, 2 because they are separators in sort keys.
      See Also:
    • SLOPE_MAX_

      private static final int SLOPE_MAX_
      See Also:
    • SLOPE_MIDDLE_

      private static final int SLOPE_MIDDLE_
      See Also:
    • SLOPE_TAIL_COUNT_

      private static final int SLOPE_TAIL_COUNT_
      See Also:
    • SLOPE_MAX_BYTES_

      private static final int SLOPE_MAX_BYTES_
      See Also:
    • SLOPE_SINGLE_

      private static final int SLOPE_SINGLE_
      Number of lead bytes: 1 middle byte for 0 2*80=160 single bytes for !=0 2*42=84 for double-byte values 2*3=6 for 3-byte values 2*1=2 for 4-byte values The sum must be invalid input: '<'=SLOPE_TAIL_COUNT. Why these numbers? - There should be >=128 single-byte values to cover 128-blocks with small scripts. - There should be >=20902 single/double-byte values to cover Unihan. - It helps CJK Extension B some if there are 3-byte values that cover the distance between them and Unihan. This also helps to jump among distant places in the BMP. - Four-byte values are necessary to cover the rest of Unicode. Symmetrical lead byte counts are for convenience. With an equal distribution of even and odd differences there is also no advantage to asymmetrical lead byte counts.
      See Also:
    • SLOPE_LEAD_2_

      private static final int SLOPE_LEAD_2_
      See Also:
    • SLOPE_LEAD_3_

      private static final int SLOPE_LEAD_3_
      See Also:
    • SLOPE_REACH_POS_1_

      private static final int SLOPE_REACH_POS_1_
      The difference value range for single-byters.
      See Also:
    • SLOPE_REACH_NEG_1_

      private static final int SLOPE_REACH_NEG_1_
      See Also:
    • SLOPE_REACH_POS_2_

      private static final int SLOPE_REACH_POS_2_
      The difference value range for double-byters.
      See Also:
    • SLOPE_REACH_NEG_2_

      private static final int SLOPE_REACH_NEG_2_
      See Also:
    • SLOPE_REACH_POS_3_

      private static final int SLOPE_REACH_POS_3_
      The difference value range for 3-byters.
      See Also:
    • SLOPE_REACH_NEG_3_

      private static final int SLOPE_REACH_NEG_3_
      See Also:
    • SLOPE_START_POS_2_

      private static final int SLOPE_START_POS_2_
      The lead byte start values.
      See Also:
    • SLOPE_START_POS_3_

      private static final int SLOPE_START_POS_3_
      See Also:
    • SLOPE_START_NEG_2_

      private static final int SLOPE_START_NEG_2_
      See Also:
    • SLOPE_START_NEG_3_

      private static final int SLOPE_START_NEG_3_
      See Also:
  • Constructor Details

    • BOCSU

      private BOCSU()
      Constructor private to prevent initialization
  • Method Details

    • writeIdenticalLevelRun

      public static int writeIdenticalLevelRun(int prev, CharSequence s, int i, int length, ByteArrayWrapper sink)
      Encode the code points of a string as a sequence of byte-encoded differences (slope detection), preserving lexical order.

      Optimize the difference-taking for runs of Unicode text within small scripts:

      Most small scripts are allocated within aligned 128-blocks of Unicode code points. Lexical order is preserved if "prev" is always moved into the middle of such a block.

      Additionally, "prev" is moved from anywhere in the Unihan area into the middle of that area. Note that the identical-level run in a sort key is generated from NFD text - there are never Hangul characters included.

    • ensureAppendCapacity

      private static void ensureAppendCapacity(ByteArrayWrapper sink, int minCapacity, int desiredCapacity)
    • getNegDivMod

      private static final long getNegDivMod(int number, int factor)
      Integer division and modulo with negative numerators yields negative modulo results and quotients that are one more than what we need here.
      Parameters:
      number - which operations are to be performed on
      factor - the factor to use for division
      Returns:
      (result of division) invalid input: '<'invalid input: '<' 32 | modulo
    • writeDiff

      private static final int writeDiff(int diff, byte[] buffer, int offset)
      Encode one difference value -0x10ffff..+0x10ffff in 1..4 bytes, preserving lexical order
      Parameters:
      diff -
      buffer - byte buffer to append to
      offset - to the byte buffer to start appending
      Returns:
      end offset where the appending stops