Why Enumerable.Except() Might Not Work the Way You Might Expect
Enumerable.Except
is one of the useful extension methods within the System.Linq
namespace that shipped with .NET 3.5. According to the documentation, Enumerable.Except
"produces the set difference of two sequences".
The static System.Linq.Enumerable
class contains two overloads of the Except
method:
Enumerable.Except<TSource>(IEnumerable<TSource>, IEnumerable<TSource>)
Enumerable.Except<TSource>(IEnumerable<TSource>, IEnumerable<TSource>, IEqualityComparer<TSource>)
#Overload #1 — Using the Default Equality Comparer
The first overload uses the default equality comparer to compare values. Take a minute and think about what the following code snippet will output:
string[] fruits = { "apple", "apricot", "banana", "strawberry" };
string[] fruitsWithLongNames = { "strawberry" };
IEnumerable<string> fruitsWithShortNames = fruits.Except(fruitsWithLongNames);
Console.WriteLine("Using the default equality comparer:");
foreach (string fruit in fruitsWithShortNames)
{
Console.WriteLine(" - {0}", fruit);
}
Most likely, the output will match your expectations:
#Overload #2 — Using a Custom Equality Comparer
Now let's look at the overload accepting an IEqualityComparer<T>
. We pass in an instance of StringLengthEqualityComparer
, a custom IEqualityComparer<string>
that considers two strings equal if their character count equals. Again — take some time and reflect about what you expect the output to be:
string[] fruits = { "apple", "banana", "cherry", "strawberry" };
string[] fruitsWithLongNames = { "strawberry" };
var stringLengthComparer = new StringLengthEqualityComparer();
IEnumerable<string> fruitsWithShortNames = fruits
.Except(fruitsWithLongNames, stringLengthComparer);
Console.WriteLine("Using our custom equality comparer:");
foreach (string fruit in fruitsWithShortNames)
{
Console.WriteLine(" - {0}", fruit);
}
And here's the StringLengthEqualityComparer
:
class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x.Length == y.Length;
}
public int GetHashCode(string obj)
{
return obj.Length;
}
}
Since our custom StringLengthEqualityComparer
compares the length of two strings, I would intuitively consider fruitsWithShortNames
to contain all fruits but those with the same string length as strawberry. Because fruits solely contains one element with a matching string length of 10 characters, namely strawberry itsself, I expected the snippet above to output apple, banana and cherry. I ran the program — and learned that I was wrong:
Besides strawberry, the element cherry got removed as well although its string length does not equal 10 (but 6). Why is that? To answer this question, we need to have a look at how the Except
extension method is implemented.
#Analyzing the Implementation of Enumerable.Except
Decompiling the framework code using .NET Reflector 7 shows the following implementation:
public static IEnumerable<TSource> Except<TSource>(this IEnumerable<TSource> first,
IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
if (first == null)
{
throw Error.ArgumentNull("first");
}
if (second == null)
{
throw Error.ArgumentNull("second");
}
return ExceptIterator<TSource>(first, second, comparer);
}
Here's the private ExceptIterator<TSource>
method:
private static IEnumerable<TSource> ExceptIterator<TSource>(IEnumerable<TSource> first,
IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
Set<TSource> iteratorVariable0 = new Set<TSource>(comparer);
foreach (TSource local in second)
{
iteratorVariable0.Add(local);
}
foreach (TSource iteratorVariable1 in first)
{
if (!iteratorVariable0.Add(iteratorVariable1))
{
continue;
}
yield return iteratorVariable1;
}
}
Update (May 6, 2014): Now that the .NET Framework is open source, we can take a look at the actual implementation of ExceptIterator
.
The ExceptIterator<TSource>
method makes use of the internal Set<TSource>
class representing a set, a collection of distinct objects. It is comparable to the HashSet<T>
class living in the System.Collections.Generic
namespace. The Set<TSource>.Add<TSource>
method returns true if the passed item was successfully added to the set and returns false if the item was already present; in that case, the item is not added. To determine if two items are considered equal, the Set<TSource>
class uses an IEqualityComparer<TSource>
. This is where our custom StringLengthEqualityComparer
comes into operation.
#Tracking Down the Cherry Issue
As we can see in the first 4 lines of ExceptIterator<TSource>
, the items of second
are added one by one to the set using the Set<TSource>.Add<TSource>
method that makes sure the set only contains distinct items. After that, each item of first
is added the same way.
Let's have a look at our example and find out why cherry is no part of the resulting collection:
second
only contains one item, strawberry, which gets added to the set.- The first element of
first
is apple. The set does not contain any item considered equal to apple using our customStringLengthEqualityComparer
. From this it follows that apple gets added to the set and is returned byyield return
. - The same goes for the next element, banana. Neither strawberry nor apple equals banana; thus, banana gets added to the set and is being returned. The set now contains the elements strawberry, apple and banana, the resulting collection contains apple and banana.
- The next element, cherry, is neither equal to strawberry nor apple; however, it equals banana in that its string length is 6, too. Since
iteratorVariable0.Add(iteratorVariable1)
returnsfalse
, the condition istrue
andcontinue
passes control to the next iteration of the enclosingforeach
loop.yield return
is not being called; hence, banana is not returned and therefore no part of the resulting collection. - The last element of
first
, strawberry, is already present in the set and is, because of that, no part of the resulting collection. Theforeach
loop terminates and results in apple and banana being the only elements of the resulting collection.
#Conclusion
ExceptIterator<TSource>
compares each element of first
to each element of second
and to each previous element of first
. The thing you need to keep in mind when using the Except
extension method is: If first
contains multiple elements considered equal, the resulting collection only contains the first of these elements.
If you do not want to remove elements of first
that do not equal any element of second
but any element of first
, you can use the Without
extension method (have a look at ExtraLINQ, a class library of mine providing additional extension methods for LINQ to Objects).
Similar Posts: