The other day a colleague of mine was trying to get distinct values out of a list of doubles. He wanted two doubles to be considered equal if the difference between them was less than 0.1.
C# does not offer a
Distinct overload that takes a
Func<T, T, bool> to handle the comparison. It does, however, offer one which takes an
IEqualityComparer. He decided to make use of this overload and implemented something that looked roughly like the following.
However instead of returning the expected output.
His implementation returned the following.
The answer to why this happened is hidden in a two-line comment in the Examples section of the
Enumerable.Distinct MSDN page.
If Equals() returns true for a pair of objects then GetHashCode() must return the same value for these objects.
A quick check of the Reference Source entry for the method confirms that in order for two elements to be considered equal,
Equals must return true and
GetHashCode must return the same value for both elements.
I found this behaviour to be pretty surprising. Although it is a well-known convention that equal objects should have the same hash code, when one is supplying a custom equality comparer, it is probably safe to say that the objects would not typically be considered to be equal to one another.
Perhaps the most disturbing facet of this implementation, however, is that it is notoriously easy to do the following in order to get the desired result.
While this implementation will work, it relies on the fact that
Distinct will always use
GetHashCode to make the comparison. As we cannot guarantee that this will remain the case in future versions of the .NET Framework, I feel that this type of implementation is a bit too hacky for my liking, and should be avoided.
So, how can we get the result that we’re after without compromising good design principles? Two simple solutions come to mind. The first thing that we could do is simply brute-force the problem.
The obvious issue with this method is that it will execute in O(n²) time.
An alternative solution is to sort the list first.
This will execute in the same O as our sort, which we can generally take to mean O(n*log(n)).
The one important difference between these two solutions, aside from the execution time, is that
DistinctBruteForce will return distinct elements in the order that they appear in the source collection, whereas
DistinctSorted will add them in order from smallest to largest. We can see this difference if we modify our method to take a different collection, and to use all three comparison methods (hacky
The console output obtained by running this is as follows.
As you can see, the
Sorted implementation includes
1.1, whereas the other two only include the middle value,
1.01. While in many cases I would expect the
Sorted result to be the more desirable of the two, knowing that these implementations give different results could no doubt prove useful when having to deal with a problem such as this in the future.