Class Utf8
java.lang.Object
com.google.common.base.Utf8
Low-level, high-performance utility methods related to the UTF-8
character encoding. UTF-8 is defined in section D92 of The Unicode Standard Core
Specification, Chapter 3.
The variant of UTF-8 implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1. One implication of this is that it rejects "non-shortest form" byte sequences, even though the JDK decoder may accept them.
- Since:
- 16.0
- Author:
- Martin Buchholz, Clément Roux
-
Method Summary
Modifier and TypeMethodDescriptionstatic int
encodedLength
(CharSequence sequence) Returns the number of bytes in the UTF-8-encoded form ofsequence
.static boolean
isWellFormed
(byte[] bytes) Returnstrue
ifbytes
is a well-formed UTF-8 byte sequence according to Unicode 6.0.static boolean
isWellFormed
(byte[] bytes, int off, int len) Returns whether the given byte array slice is a well-formed UTF-8 byte sequence, as defined byisWellFormed(byte[])
.
-
Method Details
-
encodedLength
Returns the number of bytes in the UTF-8-encoded form ofsequence
. For a string, this method is equivalent tostring.getBytes(UTF_8).length
, but is more efficient in both time and space.- Throws:
IllegalArgumentException
- ifsequence
contains ill-formed UTF-16 (unpaired surrogates)
-
isWellFormed
Returnstrue
ifbytes
is a well-formed UTF-8 byte sequence according to Unicode 6.0. Note that this is a stronger criterion than simply whether the bytes can be decoded. For example, some versions of the JDK decoder will accept "non-shortest form" byte sequences, but encoding never reproduces these. Such byte sequences are not considered well-formed.This method returns
true
if and only ifArrays.equals(bytes, new String(bytes, UTF_8).getBytes(UTF_8))
does, but is more efficient in both time and space. -
isWellFormed
Returns whether the given byte array slice is a well-formed UTF-8 byte sequence, as defined byisWellFormed(byte[])
. Note that this can be false even whenisWellFormed(bytes)
is true.- Parameters:
bytes
- the input bufferoff
- the offset in the buffer of the first byte to readlen
- the number of bytes to read from the buffer
-