Interface HashFunction
Definition
Unpacking this definition:
- block of data: the input for a hash function is always, in concept, an ordered byte
array. This hashing API accepts an arbitrary sequence of byte and multibyte values (via
Hasher
), but this is merely a convenience; these are always translated into raw byte sequences under the covers. - hash code: each hash function always yields hash codes of the same fixed bit length
(given by
bits()
). For example,Hashing.sha1()
produces a 160-bit number, whileHashing.murmur3_32()
yields only 32 bits. Because along
value is clearly insufficient to hold all hash code values, this API represents a hash code as an instance ofHashCode
. - pure function: the value produced must depend only on the input bytes, in the order
they appear. Input data is never modified.
HashFunction
instances should always be stateless, and therefore thread-safe. - collision-averse: while it can't be helped that a hash function will sometimes produce the same hash code for distinct inputs (a "collision"), every hash function strives to some degree to make this unlikely. (Without this condition, a function that always returns zero could be called a hash function. It is not.)
Summarizing the last two points: "equal yield equal always; unequal yield unequal often." This is the most important characteristic of all hash functions.
Desirable properties
A high-quality hash function strives for some subset of the following virtues:
- collision-resistant: while the definition above requires making at least some token attempt, one measure of the quality of a hash function is how well it succeeds at this goal. Important note: it may be easy to achieve the theoretical minimum collision rate when using completely random sample input. The true test of a hash function is how it performs on representative real-world data, which tends to contain many hidden patterns and clumps. The goal of a good hash function is to stamp these patterns out as thoroughly as possible.
- bit-dispersing: masking out any single bit from a hash code should yield only the expected twofold increase to all collision rates. Informally, the "information" in the hash code should be as evenly "spread out" through the hash code's bits as possible. The result is that, for example, when choosing a bucket in a hash table of size 2^8, any eight bits could be consistently used.
- cryptographic: certain hash functions such as
Hashing.sha512()
are designed to make it as infeasible as possible to reverse-engineer the input that produced a given hash code, or even to discover any two distinct inputs that yield the same result. These are called cryptographic hash functions. But, whenever it is learned that either of these feats has become computationally feasible, the function is deemed "broken" and should no longer be used for secure purposes. (This is the likely eventual fate of all cryptographic hashes.) - fast: perhaps self-explanatory, but often the most important consideration.
Providing input to a hash function
The primary way to provide the data that your hash function should act on is via a Hasher
. Obtain a new hasher from the hash function using newHasher()
, "push" the relevant
data into it using methods like Hasher.putBytes(byte[])
, and finally ask for the
HashCode
when finished using Hasher.hash()
. (See an example of
this.)
If all you want to hash is a single byte array, string or long
value, there are
convenient shortcut methods defined directly on HashFunction
to make this easier.
Hasher accepts primitive data types, but can also accept any Object of type T
provided
that you implement a Funnel
<T>
to specify how to "feed" data from that object
into the function. (See an example of this.)
Compatibility note: Throughout this API, multibyte values are always interpreted in
little-endian order. That is, hashing the byte array {0x01, 0x02, 0x03, 0x04}
is
equivalent to hashing the int
value 0x04030201
. If this isn't what you need,
methods such as Integer.reverseBytes(int)
and Ints.toByteArray(int)
will help.
Relationship to Object.hashCode()
Java's baked-in concept of hash codes is constrained to 32 bits, and provides no separation
between hash algorithms and the data they act on, so alternate hash algorithms can't be easily
substituted. Also, implementations of hashCode
tend to be poor-quality, in part because
they end up depending on other existing poor-quality hashCode
implementations,
including those in many JDK classes.
Object.hashCode
implementations tend to be very fast, but have weak collision
prevention and no expectation of bit dispersion. This leaves them perfectly suitable for
use in hash tables, because extra collisions cause only a slight performance hit, while poor bit
dispersion is easily corrected using a secondary hash function (which all reasonable hash table
implementations in Java use). For the many uses of hash functions beyond data structures,
however, Object.hashCode
almost always falls short -- hence this library.
- Since:
- 11.0
- Author:
- Kevin Bourrillion
-
Method Summary
Modifier and TypeMethodDescriptionint
bits()
Returns the number of bits (a multiple of 32) that each hash code produced by this hash function has.hashBytes
(byte[] input) Shortcut fornewHasher().putBytes(input).hash()
.hashBytes
(byte[] input, int off, int len) Shortcut fornewHasher().putBytes(input, off, len).hash()
.hashBytes
(ByteBuffer input) Shortcut fornewHasher().putBytes(input).hash()
.hashInt
(int input) Shortcut fornewHasher().putInt(input).hash()
; returns the hash code for the givenint
value, interpreted in little-endian byte order.hashLong
(long input) Shortcut fornewHasher().putLong(input).hash()
; returns the hash code for the givenlong
value, interpreted in little-endian byte order.hashObject
(T instance, Funnel<? super T> funnel) Shortcut fornewHasher().putObject(instance, funnel).hash()
.hashString
(CharSequence input, Charset charset) Shortcut fornewHasher().putString(input, charset).hash()
.hashUnencodedChars
(CharSequence input) Shortcut fornewHasher().putUnencodedChars(input).hash()
.Begins a new hash code computation by returning an initialized, statefulHasher
instance that is ready to receive data.newHasher
(int expectedInputSize) Begins a new hash code computation asnewHasher()
, but provides a hint of the expected size of the input (in bytes).
-
Method Details
-
newHasher
-
newHasher
Begins a new hash code computation asnewHasher()
, but provides a hint of the expected size of the input (in bytes). This is only important for non-streaming hash functions (hash functions that need to buffer their whole input before processing any of it). -
hashInt
-
hashLong
-
hashBytes
-
hashBytes
Shortcut fornewHasher().putBytes(input, off, len).hash()
. The implementation might perform better than its longhand equivalent, but should not perform worse.- Throws:
IndexOutOfBoundsException
- ifoff < 0
oroff + len > bytes.length
orlen < 0
-
hashBytes
Shortcut fornewHasher().putBytes(input).hash()
. The implementation might perform better than its longhand equivalent, but should not perform worse.- Since:
- 23.0
-
hashUnencodedChars
Shortcut fornewHasher().putUnencodedChars(input).hash()
. The implementation might perform better than its longhand equivalent, but should not perform worse. Note that no character encoding is performed; the low byte and high byte of eachchar
are hashed directly (in that order).Warning: This method will produce different output than most other languages do when running the same hash function on the equivalent input. For cross-language compatibility, use
hashString(java.lang.CharSequence, java.nio.charset.Charset)
, usually with a charset of UTF-8. For other use cases, usehashUnencodedChars
.- Since:
- 15.0 (since 11.0 as hashString(CharSequence)).
-
hashString
Shortcut fornewHasher().putString(input, charset).hash()
. Characters are encoded using the givenCharset
. The implementation might perform better than its longhand equivalent, but should not perform worse.Warning: This method, which reencodes the input before hashing it, is useful only for cross-language compatibility. For other use cases, prefer
hashUnencodedChars(java.lang.CharSequence)
, which is faster, produces the same output across Java releases, and hashes everychar
in the input, even if some are invalid. -
hashObject
Shortcut fornewHasher().putObject(instance, funnel).hash()
. The implementation might perform better than its longhand equivalent, but should not perform worse.- Since:
- 14.0
-
bits
int bits()Returns the number of bits (a multiple of 32) that each hash code produced by this hash function has.
-