Class UnicodeEscaper
- Direct Known Subclasses:
ArrayBasedUnicodeEscaper
,PercentEscaper
Escaper
that converts literal text into a format safe for inclusion in a particular
context (such as an XML document). Typically (but not always), the inverse process of
"unescaping" the text is performed automatically by the relevant parser.
For example, an XML escaper would convert the literal string "Foo<Bar>"
into
"Foo<Bar>"
to prevent "<Bar>"
from being confused with an XML tag. When the
resulting XML document is parsed, the parser API will return this text as the original literal
string "Foo<Bar>"
.
Note: This class is similar to CharEscaper
but with one very important
difference. A CharEscaper can only process Java UTF16 characters in isolation and may not cope
when it encounters surrogate pairs. This class facilitates the correct escaping of all Unicode
characters.
As there are important reasons, including potential security issues, to handle Unicode correctly if you are considering implementing a new escaper you should favor using UnicodeEscaper wherever possible.
A UnicodeEscaper
instance is required to be stateless, and safe when used concurrently
by multiple threads.
Popular escapers are defined as constants in classes like HtmlEscapers
and XmlEscapers
. To create
your own escapers extend this class and implement the escape(int)
method.
- Since:
- 15.0
- Author:
- David Beaumont
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected static int
codePointAt
(CharSequence seq, int index, int end) Returns the Unicode code point of the character at the given index.protected abstract char @Nullable []
escape
(int cp) Returns the escaped form of the given Unicode code point, ornull
if this code point does not need to be escaped.Returns the escaped form of a given literal string.protected final String
escapeSlow
(String s, int index) Returns the escaped form of a given literal string, starting at the given index.protected int
nextEscapeIndex
(CharSequence csq, int start, int end) Scans a sub-sequence of characters from a givenCharSequence
, returning the index of the next character that requires escaping.Methods inherited from class com.google.common.escape.Escaper
asFunction
-
Constructor Details
-
UnicodeEscaper
protected UnicodeEscaper()Constructor for use by subclasses.
-
-
Method Details
-
escape
Returns the escaped form of the given Unicode code point, ornull
if this code point does not need to be escaped. When called as part of an escaping operation, the given code point is guaranteed to be in the range0 <= cp <= Character#MAX_CODE_POINT
.If an empty array is returned, this effectively strips the input character from the resulting text.
If the character does not need to be escaped, this method should return
null
, rather than an array containing the character representation of the code point. This enables the escaping algorithm to perform more efficiently.If the implementation of this method cannot correctly handle a particular code point then it should either throw an appropriate runtime exception or return a suitable replacement character. It must never silently discard invalid input as this may constitute a security risk.
- Parameters:
cp
- the Unicode code point to escape if necessary- Returns:
- the replacement characters, or
null
if no escaping was needed
-
escape
Returns the escaped form of a given literal string.If you are escaping input in arbitrary successive chunks, then it is not generally safe to use this method. If an input string ends with an unmatched high surrogate character, then this method will throw
IllegalArgumentException
. You should ensure your input is valid UTF-16 before calling this method.Note: When implementing an escaper it is a good idea to override this method for efficiency by inlining the implementation of
nextEscapeIndex(CharSequence, int, int)
directly. Doing this forPercentEscaper
more than doubled the performance for unescaped strings (as measured byCharEscapersBenchmark
).- Specified by:
escape
in classEscaper
- Parameters:
string
- the literal string to be escaped- Returns:
- the escaped form of
string
- Throws:
NullPointerException
- ifstring
is nullIllegalArgumentException
- if invalid surrogate characters are encountered
-
nextEscapeIndex
Scans a sub-sequence of characters from a givenCharSequence
, returning the index of the next character that requires escaping.Note: When implementing an escaper, it is a good idea to override this method for efficiency. The base class implementation determines successive Unicode code points and invokes
escape(int)
for each of them. If the semantics of your escaper are such that code points in the supplementary range are either all escaped or all unescaped, this method can be implemented more efficiently usingCharSequence.charAt(int)
.Note however that if your escaper does not escape characters in the supplementary range, you should either continue to validate the correctness of any surrogate characters encountered or provide a clear warning to users that your escaper does not validate its input.
See
PercentEscaper
for an example.- Parameters:
csq
- a sequence of charactersstart
- the index of the first character to be scannedend
- the index immediately after the last character to be scanned- Throws:
IllegalArgumentException
- if the scanned sub-sequence ofcsq
contains invalid surrogate pairs
-
escapeSlow
Returns the escaped form of a given literal string, starting at the given index. This method is called by theescape(String)
method when it discovers that escaping is required. It is protected to allow subclasses to override the fastpath escaping function to inline their escaping test. SeeCharEscaperBuilder
for an example usage.This method is not reentrant and may only be invoked by the top level
escape(String)
method.- Parameters:
s
- the literal string to be escapedindex
- the index to start escaping from- Returns:
- the escaped form of
string
- Throws:
NullPointerException
- ifstring
is nullIllegalArgumentException
- if invalid surrogate characters are encountered
-
codePointAt
Returns the Unicode code point of the character at the given index.Unlike
Character.codePointAt(CharSequence, int)
orString.codePointAt(int)
this method will never fail silently when encountering an invalid surrogate pair.The behaviour of this method is as follows:
- If
index >= end
,IndexOutOfBoundsException
is thrown. - If the character at the specified index is not a surrogate, it is returned.
- If the first character was a high surrogate value, then an attempt is made to read the
next character.
- If the end of the sequence was reached, the negated value of the trailing high surrogate is returned.
- If the next character was a valid low surrogate, the code point value of the high/low surrogate pair is returned.
- If the next character was not a low surrogate value, then
IllegalArgumentException
is thrown.
- If the first character was a low surrogate value,
IllegalArgumentException
is thrown.
- Parameters:
seq
- the sequence of characters from which to decode the code pointindex
- the index of the first character to decodeend
- the index beyond the last valid character to decode- Returns:
- the Unicode code point for the given index or the negated value of the trailing high surrogate character at the end of the sequence
- If
-