Guidlines and rules About Overwriting hashCode()

Preface

　　"The code is more what you’d call guidelines than actual rules" – truer words were never spoken. It’s important when writing code to understand what are vague "guidelines" that should be followed but can be broken or fudged, and what are crisp "rules" that have serious negative consequences for correctness and robustness. I often get questions about the rules and guidelines for hashCode, so I thought I might summarize them here.

Rule: equal items have equal hashes:

　　If two objects are equal then they must have the same hash code; or, equivalently, if two objects have different hash codes then they must be unequal.

　　The reasoning here is straightforward. Suppose two objects were equal but had different hash codes. If you put the first object in the java set or map then it might be put into special bucket of the collection . If you then ask the set whether the second object is a member, it might search another bucket , and not find it.

Guideline: the integer returned by GetHashCode should never change

　　Ideally, the hash code of a mutable object should be computed from only fields which cannot mutate, and therefore the hash value of an object is the same for its entire lifetime.

　　However, this is only an ideal-situation guideline; the actual rule is:

Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable

　　It is permissible, though dangerous, to make an object whose hash code value can mutate as the fields of the object mutate. If you have such an object and you put it in a hash table then the code which mutates the object and the code which maintains the hash table are required to have some agreed-upon protocol that ensures that the object is not mutated while it is in the hash table. What that protocol looks like is up to you.

Guideline: the implementation of GetHashCode must be extremely fast

　　The whole point of hashCode is to optimize a lookup operation; if the call hashCode is slower than looking through those ten thousand items one at a time, then you haven’t made a performance gain.

　　I classify this as a “guideline” and not a “rule” because it is so vague. How slow is too slow? That’s up to you to decide.

Guideline: the distribution of hash codes must be “random”

　　By a “random distribution” I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be “clustered”; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

　　Again, I list this as a guideline rather than a rule because it is somewhat vague, not because it is unimportant. It’s very important. But since good distribution and good speed can be opposites, it’s important to find a good balance between the two.

　　hashCode() is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:

It does not provide a unique key for an object; probability of collision is extremely high.
It is not of cryptographic strength, so do not use it as part of a digital signature or as a password equivalent
It does not necessarily have the error-detection properties needed for checksums.

and so on.