LINQ set operations

LINQ allows sequences to be combined using the usual set operations of union, intersection and difference. Using these commands is pretty straight forward, so we’ll give an example which illustrates all of them.

Using our lists of Canada’s prime ministers and their terms of office, we first use a Join() to construct a customized list where each term of office is connected with the name of the PM who served it. We then build two subsets of this list by considering terms that started in the 20th century and terms that ended in the 20th century. We can then apply the set operators to these two lists and see what we get. The code is:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      Terms[] terms = Terms.GetTermsArray();
      var pmList41 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start);
      var start20 = pmList41
        .Where(pmTerm => pmTerm.start.Year > 1900 && pmTerm.start.Year < 2001);
      var end20 = pmList41
        .Where(pmTerm => pmTerm.end.Year > 1900 && pmTerm.end.Year < 2001);
      var startOrEnd20 = start20.Union(end20)
        .OrderBy(pmTerm => pmTerm.start);
      var startAndEnd20 = start20.Intersect(end20);
      var startExceptEnd20 = start20.Except(end20);
      var endExceptStart20 = end20.Except(start20);
      foreach (var pmTerm in start20)
      {
        Console.WriteLine("{0} {1}: {2: dd MMM yyyy} to {3: dd MMM yyyy}",
          pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
      }

First, we’ll see the two sets we start with. Terms starting in the 20th century are in the list start20:

Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Jean Chrétien:  04 Nov 1993 to  11 Dec 2003

Terms ending in the 20th century are in end20:

Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911
Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993

One of the properties of a mathematical set is that it contains no duplicates. If we take the union of start20 and end20, each of the terms of office must appear only once. The way Union() works is it enumerates the first sequence, comparing each element to see if it is distinct from the others that have already been enumerated. Only distinct elements are saved. Then the second sequence is enumerated and again each element is compared with those elements saved from the first set. Thus the result of the command  var startOrEnd20 = start20.Union(end20) is

Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Jean Chrétien:  04 Nov 1993 to  11 Dec 2003
Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911

Note that although all the elements are there, the one element that is in end20 but not in start 20 (Laurier’s term) appears at the end even though by date it came first. This is because Union() yields elements in the order in which it processes them, and since end20 was the second sequence processed, its unique entries appear at the end. We have added the OrderBy() clause in the code above to fix this, and this just results in placing Laurier’s term at the start.

It’s worth pausing here to reflect on what Union() and the other set commands do when they compare two elements in the sequence to test them for equality. In the absence of any external comparer class or implementation of the IEquatable<T> interface, the equality test is done by calling the built-in Equals() method from the Object class. For reference data types (that is, objects created from classes as opposed to value types such as int), Equals() compares the references of its two objects and returns ‘true’ only if both objects have the same reference, that is, if the two objects are actually the same object, in the sense that they occupy the same location in memory. Thus two different objects that happened to have the same values for all their data fields would not be considered equal by the default Equals() method.

The code we have written here thus depends implicitly on the fact that the set operators don’t make copies of the objects they are comparing; rather they simply reorder and classify the existing objects without modifying or copying them. If we wanted to make the code a bit more iron-clad, we should provide overrides of the Equals() and GetHashCode() methods, and/or implement the IEquatable<T> interface, as described in the earlier post. In some cases, this is difficult or impossible to do, as in our example where the data type being manipulated by the set operators is an anonymous type.

With that caution in mind, we can look at the results of the other set operators. The intersection of start20 and end20 gives startAndEnd20:

Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993

Laurier’s and Chrétien’s terms have been omitted since they extended outside the 20th century. In this case we didn’t need an OrderBy() since all the included terms were in the first sequence and were already ordered.

The set difference A – B produces the set that contains all elements in A that are not in B. Thus startExceptEnd contains terms that started in the 20th century but didn’t end there. The LINQ operator for set difference is Except(), and the results are:

Jean Chrétien:  04 Nov 1993 to  11 Dec 2003

Swapping start20 and end 20 produces terms that ended in the 20th century but didn’t start then:

Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911

There is a fourth operator that, although it’s not a set operator in the mathematical sense, is lumped in with them. This is Distinct(), which removes duplicates from a sequence. For example, suppose we join together end20 and start20 and then order the results by start date, as with the code:

      var endPlusStart20 = end20.Concat(start20)
        .OrderBy(pmTerm => pmTerm.start);

We’ve used the Concat() operator which glues its argument onto the sequence that calls it. Note that Concat() is not the same as Union(), since it doesn’t exclude duplicates from its output. The result of this code is (with a loop to print out the results, as usual):

Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911
Robert Borden:  10 Oct 1911 to  10 Jul 1920
Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Jean Chrétien:  04 Nov 1993 to  11 Dec 2003

All the terms except the first and last are duplicated. If we now feed the result of this into the Distinct() operator, it strips out the duplicates and returns the original list. The code is:

      var endPlusStart20 = end20.Concat(start20)
        .OrderBy(pmTerm => pmTerm.start)
        .Distinct();

Distinct() uses the same equality test as the other set operators, so in order for it work, the above list must contain the same object duplicated in each case rather two objects, one of which is a copy of the other. Again, if you want to remove duplicates where different objects have the same data field values, you’ll need to provide a customized equality tester in some form (choose one of: implement IEquatable<T>, override Equals() and GetHashCode() from object, or provide a separate class that implements IEqualityComparer<T> for your data type).

Finally, all four of these operators have a second version in which we can pass an IEqualityComparer<T> object as a second parameter, thus allowing a custom equality test. We’ve already seen how to do this, so we won’t repeat it here.

Advertisements
Post a comment or leave a trackback: Trackback URL.

Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: