Monthly Archives: May 2012

LINQ set operations

LINQ allows sequences to be combined using the usual set operations of union, intersection and difference. Using these commands is pretty straight forward, so we’ll give an example which illustrates all of them.

Using our lists of Canada’s prime ministers and their terms of office, we first use a Join() to construct a customized list where each term of office is connected with the name of the PM who served it. We then build two subsets of this list by considering terms that started in the 20th century and terms that ended in the 20th century. We can then apply the set operators to these two lists and see what we get. The code is:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      Terms[] terms = Terms.GetTermsArray();
      var pmList41 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start);
      var start20 = pmList41
        .Where(pmTerm => pmTerm.start.Year > 1900 && pmTerm.start.Year < 2001);
      var end20 = pmList41
        .Where(pmTerm => pmTerm.end.Year > 1900 && pmTerm.end.Year < 2001);
      var startOrEnd20 = start20.Union(end20)
        .OrderBy(pmTerm => pmTerm.start);
      var startAndEnd20 = start20.Intersect(end20);
      var startExceptEnd20 = start20.Except(end20);
      var endExceptStart20 = end20.Except(start20);
      foreach (var pmTerm in start20)
      {
        Console.WriteLine("{0} {1}: {2: dd MMM yyyy} to {3: dd MMM yyyy}",
          pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
      }

First, we’ll see the two sets we start with. Terms starting in the 20th century are in the list start20:

Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Jean Chrétien:  04 Nov 1993 to  11 Dec 2003

Terms ending in the 20th century are in end20:

Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911
Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993

One of the properties of a mathematical set is that it contains no duplicates. If we take the union of start20 and end20, each of the terms of office must appear only once. The way Union() works is it enumerates the first sequence, comparing each element to see if it is distinct from the others that have already been enumerated. Only distinct elements are saved. Then the second sequence is enumerated and again each element is compared with those elements saved from the first set. Thus the result of the command  var startOrEnd20 = start20.Union(end20) is

Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Jean Chrétien:  04 Nov 1993 to  11 Dec 2003
Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911

Note that although all the elements are there, the one element that is in end20 but not in start 20 (Laurier’s term) appears at the end even though by date it came first. This is because Union() yields elements in the order in which it processes them, and since end20 was the second sequence processed, its unique entries appear at the end. We have added the OrderBy() clause in the code above to fix this, and this just results in placing Laurier’s term at the start.

It’s worth pausing here to reflect on what Union() and the other set commands do when they compare two elements in the sequence to test them for equality. In the absence of any external comparer class or implementation of the IEquatable<T> interface, the equality test is done by calling the built-in Equals() method from the Object class. For reference data types (that is, objects created from classes as opposed to value types such as int), Equals() compares the references of its two objects and returns ‘true’ only if both objects have the same reference, that is, if the two objects are actually the same object, in the sense that they occupy the same location in memory. Thus two different objects that happened to have the same values for all their data fields would not be considered equal by the default Equals() method.

The code we have written here thus depends implicitly on the fact that the set operators don’t make copies of the objects they are comparing; rather they simply reorder and classify the existing objects without modifying or copying them. If we wanted to make the code a bit more iron-clad, we should provide overrides of the Equals() and GetHashCode() methods, and/or implement the IEquatable<T> interface, as described in the earlier post. In some cases, this is difficult or impossible to do, as in our example where the data type being manipulated by the set operators is an anonymous type.

With that caution in mind, we can look at the results of the other set operators. The intersection of start20 and end20 gives startAndEnd20:

Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993

Laurier’s and Chrétien’s terms have been omitted since they extended outside the 20th century. In this case we didn’t need an OrderBy() since all the included terms were in the first sequence and were already ordered.

The set difference A – B produces the set that contains all elements in A that are not in B. Thus startExceptEnd contains terms that started in the 20th century but didn’t end there. The LINQ operator for set difference is Except(), and the results are:

Jean Chrétien:  04 Nov 1993 to  11 Dec 2003

Swapping start20 and end 20 produces terms that ended in the 20th century but didn’t start then:

Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911

There is a fourth operator that, although it’s not a set operator in the mathematical sense, is lumped in with them. This is Distinct(), which removes duplicates from a sequence. For example, suppose we join together end20 and start20 and then order the results by start date, as with the code:

      var endPlusStart20 = end20.Concat(start20)
        .OrderBy(pmTerm => pmTerm.start);

We’ve used the Concat() operator which glues its argument onto the sequence that calls it. Note that Concat() is not the same as Union(), since it doesn’t exclude duplicates from its output. The result of this code is (with a loop to print out the results, as usual):

Wilfrid Laurier:  11 Jul 1896 to  06 Oct 1911
Robert Borden:  10 Oct 1911 to  10 Jul 1920
Robert Borden:  10 Oct 1911 to  10 Jul 1920
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
Arthur Meighen:  10 Jul 1920 to  29 Dec 1921
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
William Mackenzie King:  29 Dec 1921 to  28 Jun 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
Arthur Meighen:  29 Jun 1926 to  25 Sep 1926
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
William Mackenzie King:  25 Sep 1926 to  07 Aug 1930
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
Richard Bennett:  07 Aug 1930 to  23 Oct 1935
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
William Mackenzie King:  23 Oct 1935 to  15 Nov 1948
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
Louis St. Laurent:  15 Nov 1948 to  21 Jun 1957
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
John Diefenbaker:  21 Jun 1957 to  22 Apr 1963
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Lester Pearson:  22 Apr 1963 to  20 Apr 1968
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Pierre Trudeau:  20 Apr 1968 to  03 Jun 1979
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Joe Clark:  04 Jun 1979 to  02 Mar 1980
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
Pierre Trudeau:  03 Mar 1980 to  29 Jun 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
John Turner:  30 Jun 1984 to  16 Sep 1984
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Brian Mulroney:  17 Sep 1984 to  24 Jun 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Kim Campbell:  25 Jun 1993 to  03 Nov 1993
Jean Chrétien:  04 Nov 1993 to  11 Dec 2003

All the terms except the first and last are duplicated. If we now feed the result of this into the Distinct() operator, it strips out the duplicates and returns the original list. The code is:

      var endPlusStart20 = end20.Concat(start20)
        .OrderBy(pmTerm => pmTerm.start)
        .Distinct();

Distinct() uses the same equality test as the other set operators, so in order for it work, the above list must contain the same object duplicated in each case rather two objects, one of which is a copy of the other. Again, if you want to remove duplicates where different objects have the same data field values, you’ll need to provide a customized equality tester in some form (choose one of: implement IEquatable<T>, override Equals() and GetHashCode() from object, or provide a separate class that implements IEqualityComparer<T> for your data type).

Finally, all four of these operators have a second version in which we can pass an IEqualityComparer<T> object as a second parameter, thus allowing a custom equality test. We’ve already seen how to do this, so we won’t repeat it here.

Advertisements

IEquatable and LINQ

We’ve seen how to define a custom equality tester for use in the LINQ GroupBy() command, allowing us to specify when two elements of a sequence should be placed in the same group. There’s a deeper issue here which merits some examination. The documentation for GroupBy() says that if no custom equality tester is specified, or if null is passed in for such a tester, then the default equality comparer ‘Default’ is used to compare keys. What does that mean?

This ‘Default’ is a property of the EqualityComparer<T> generic type which provides a way of building equality testing into the class T rather than writing a separate class which implements the IEqualityComparer interface. To use Default, our class T must implement the IEquatable<T> interface, which requires us to write a single method Equals(T). As you might guess, this method provides an equality test between the calling object and the argument to Equals(T).

As an example, we could rewrite our Terms class (containing a list of terms of office of Canadian prime ministers) that we’ve been using for LINQ demos so that it implements IEquatable<Terms>. We get:

  class Terms : IEquatable<Terms>
  {
    public int id;
    public DateTime start, end;

    public static ArrayList GetTermsArrayList()
    {
      ArrayList terms = new ArrayList();

      terms.Add(new Terms { id = 1, start = DateTime.Parse("1867/7/1"), end = DateTime.Parse("1873/11/5") });
      terms.Add(new Terms { id = 1, start = DateTime.Parse("1878/10/17"), end = DateTime.Parse("1891/6/6") });
      terms.Add(new Terms { id = 2, start = DateTime.Parse("1873/11/7"), end = DateTime.Parse("1878/10/8") });
      terms.Add(new Terms { id = 3, start = DateTime.Parse("1891/6/16"), end = DateTime.Parse("1892/11/24") });
      terms.Add(new Terms { id = 4, start = DateTime.Parse("1892/12/5"), end = DateTime.Parse("1894/12/12") });
      terms.Add(new Terms { id = 5, start = DateTime.Parse("1894/12/21"), end = DateTime.Parse("1896/4/27") });
      terms.Add(new Terms { id = 6, start = DateTime.Parse("1896/5/1"), end = DateTime.Parse("1896/7/8") });
      terms.Add(new Terms { id = 7, start = DateTime.Parse("1896/7/11"), end = DateTime.Parse("1911/10/6") });
      terms.Add(new Terms { id = 8, start = DateTime.Parse("1911/10/10"), end = DateTime.Parse("1920/7/10") });
      terms.Add(new Terms { id = 9, start = DateTime.Parse("1920/7/10"), end = DateTime.Parse("1921/12/29") });
      terms.Add(new Terms { id = 9, start = DateTime.Parse("1926/6/29"), end = DateTime.Parse("1926/9/25") });
      terms.Add(new Terms { id = 10, start = DateTime.Parse("1921/12/29"), end = DateTime.Parse("1926/6/28") });
      terms.Add(new Terms { id = 10, start = DateTime.Parse("1926/9/25"), end = DateTime.Parse("1930/8/7") });
      terms.Add(new Terms { id = 10, start = DateTime.Parse("1935/10/23"), end = DateTime.Parse("1948/11/15") });
      terms.Add(new Terms { id = 11, start = DateTime.Parse("1930/8/7"), end = DateTime.Parse("1935/10/23") });
      terms.Add(new Terms { id = 12, start = DateTime.Parse("1948/11/15"), end = DateTime.Parse("1957/6/21") });
      terms.Add(new Terms { id = 13, start = DateTime.Parse("1957/6/21"), end = DateTime.Parse("1963/4/22") });
      terms.Add(new Terms { id = 14, start = DateTime.Parse("1963/4/22"), end = DateTime.Parse("1968/4/20") });
      terms.Add(new Terms { id = 15, start = DateTime.Parse("1968/4/20"), end = DateTime.Parse("1979/6/3") });
      terms.Add(new Terms { id = 15, start = DateTime.Parse("1980/3/3"), end = DateTime.Parse("1984/6/29") });
      terms.Add(new Terms { id = 16, start = DateTime.Parse("1979/6/4"), end = DateTime.Parse("1980/3/2") });
      terms.Add(new Terms { id = 17, start = DateTime.Parse("1984/6/30"), end = DateTime.Parse("1984/9/16") });
      terms.Add(new Terms { id = 18, start = DateTime.Parse("1984/9/17"), end = DateTime.Parse("1993/6/24") });
      terms.Add(new Terms { id = 19, start = DateTime.Parse("1993/6/25"), end = DateTime.Parse("1993/11/3") });
      terms.Add(new Terms { id = 20, start = DateTime.Parse("1993/11/4"), end = DateTime.Parse("2003/12/11") });
      terms.Add(new Terms { id = 21, start = DateTime.Parse("2003/12/12"), end = DateTime.Parse("2006/2/5") });
      terms.Add(new Terms { id = 22, start = DateTime.Parse("2006/2/6"), end = DateTime.Now });

      return terms;
    }

    public override string ToString()
    {
      return id + ". " + start.ToString("ddd dd MMM yyyy") + " - " + end.ToString("ddd dd MMM yyyy");
    }

    public static Terms[] GetTermsArray()
    {
      return (Terms[])GetTermsArrayList().ToArray(typeof(Terms));
    }

    public bool Equals(Terms other)
    {
      return this.id == other.id;
    }
  }

In this example, we define two terms to be equal if their id numbers (representing the prime minister who held that office) are equal.

With this definition, a call to EqualityComparer<Terms>.Default.Equals(term1, term2) will call this Equals() method using term1 as the source object and passing term2 as the argument.

Now we might think that the following code will group the terms according to their id:

      Terms[] terms = Terms.GetTermsArray();
      var pmList40 = terms
        .GroupBy(term => term);
      foreach (var group in pmList40)
      {
        Console.WriteLine("Group:");
        foreach (var term in group)
        {
          Console.WriteLine(term);
        }
      }

This code contains the simplest call to GroupBy(), specifying that the Terms object ‘term’ itself is to be used as the key. If everything works, since we haven’t specified an IEqualityComparer object, the Default option should be called, resulting in the terms being grouped according to their id.

However, it doesn’t work; every term is placed in a separate group. What went wrong?

You might remember that there is another Equals() method associated with the Object class that is the base of all classes in C#. In practice, some methods will call our new Equals() method (defined as an implementation of the IEquatable<T> interface) while others will call the method inherited from Object. So to make sure that equality is always tested the same way, we should provide an overridden version of the Object Equals() method that does the same test as our IEquatable<T> version. That is, we should add the following method to our Terms class:

    public override bool Equals(object obj)
    {
      return this.id == ((Terms)obj).id;
    }

Now we try our GroupBy() call again. However, it still doesn’t work, and by placing breakpoints in the debugger we can see that neither of these Equals() methods is getting called. What’s going on? How can GroupBy() being doing any grouping if it never does any comparisons between keys?

In fact, what GroupBy() does for each element is first calculate its hash code, and only if two hash codes are equal does it then call Equals() to do a comparison. The Object class also provides a GetHashCode() method which returns the int hash code for any given object. Thus to provide a correct and complete implementation of IEquatable<T>, we need also to override the GetHashCode() method so that it returns the same hash code whenever the Equals() methods say that two elements are equal. Since we’re defining equality based on the id number, we can add this method to our class:

    public override int GetHashCode()
    {
      return id.GetHashCode();
    }

Now if we run the GroupBy() again, we find that it works: terms with the same id are placed in the same group. Also, if we trace the code with the debugger, we find that every term results in a call to GetHashCode() but a call to Equals() (the IEquatable version) is made only if two hash codes are the same. In this case, the overridden version of the Object Equals() method is never called so we didn’t really need it, but it’s a good idea to have it there anyway since other code could call it and we want our equality tests to be consistent.

In summary, then, the proper way to implement IEquatable<T> is to provide its Equals() method, and override both Equals() and GetHashCode() from the Object class, ensuring that both Equals() methods make the same test and that GetHashCode() returns the same code for any two elements that are defined as ‘equal’.

LINQ Groups: Equality testing and result selection

In the last post we saw how to use LINQ GroupBy() for relatively simple grouping. GroupBy() is capable of a couple of more advanced features which are worth looking at.

Custom equality tests

First, we saw before that the key used by GroupBy() to do the grouping could be calculated from the data fields in the objects in the sequence being grouped, rather than being just one of the bare data fields itself. For simple cases, it’s easiest to just place this calculation directly in the call to GroupBy() as we did earlier. However, sometimes the grouping key gets a bit more complex. LINQ allows us to define our own equality test for use in determining how keys are compared. As an example, suppose we wanted to group the terms of office of Canada’s prime ministers according to how many years each of these terms spanned. That is, we’d like all terms less than a year in one group, then those between 1 and 2 years and so on. Since a Terms object contains only the start and end dates of the term as DateTime objects, we need to calculate the difference to get a TimeSpan object and then declare that two such objects that lie within the same span of years are ‘equal’.

In order to create an equality test, we need to write a custom class that implements the IEqualityComparer<T> interface, where T is the data type being compared. This interface has two methods, Equals(T, T) and GetHashCode(T). The Equals() method returns a bool which is true if its two arguments are defined as equal and false if not. The GetHashCode() method is needed since grouping is done by storing sequence elements in a hash table, so we need to make sure that the hash codes for two elements that are defined as ‘equal’ are the same.

For our example here, we can use the following class:

  class TermEqualityComparer : IEqualityComparer<TimeSpan>
  {
    public bool Equals(TimeSpan x, TimeSpan y)
    {
      return x.Days / 365 == y.Days / 365;
    }

    public int GetHashCode(TimeSpan obj)
    {
      return (obj.Days / 365).GetHashCode();
    }
  }

Our equality test divides the number of days in each TimeSpan object by 365 (OK, we’re ignoring leap years) using integer division. If the two TimeSpans are equal in this measure then they represent terms that lie in the same one-year span.

For the hash code, we just use the same division and return the built-in hash code for the quotient. This ensures that all TimeSpans within the same year get the same hash code.

With this class, we can now write a GroupBy() call that does what we want:

      TermEqualityComparer termEqualityComparer = new TermEqualityComparer();
      var pmList37 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.end - pmTerm.start, termEqualityComparer)
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList37)
      {
        int years = pmGroup.Key.Days / 365;
        Console.WriteLine("{0} to {1} years:", years, years + 1);
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
            pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
        }
      }

We declare a TermEqualityComparer object first. The LINQ code is much the same as in our earlier example in the last post, up to the GroupBy() call. This time it has two arguments. The first is the quantity to be used as the key, as usual, which in this case is the difference between the start and end of the term. The second argument is the equality testing object, so GroupBy() will pass the first argument to the Equals() method in the equality tester for each sequence element and use that test to sort the elements into groups.

You might wonder about the last OrderBy() call, which sorts the groups based on their keys. The actual TimeSpans for each element within a group may all be different, but according to our equality test, all TimeSpans within a single group are ‘equal’, so it doesn’t matter which one is used in the OrderBy().

Where the actual values of the keys does matter though is when we try to use their value in some other calculation. In our example, we want to print out the groups of terms, with each labelled by its key. However, if there is more than one element in a group, the TimeSpan for each element will probably be different, and since only one key is saved for each group, we can’t be sure which element in the group has that key (in fact, it seems to be the first element assigned to the group that has its key used for the group). Thus it’s usually best to use keys only in the same way that the original GroupBy() call did. In our example, we divide pmGroup.Key.Days by 365 to get the year span represented by that key, since we know that value does apply to all elements within that group.

The result of the code is:

0 to 1 years:
  Charles Tupper: 01 May 1896 to 08 Jul 1896
  Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
  Joe Clark: 04 Jun 1979 to 02 Mar 1980
  John Turner: 30 Jun 1984 to 16 Sep 1984
  Kim Campbell: 25 Jun 1993 to 03 Nov 1993
1 to 2 years:
  John Abbott: 16 Jun 1891 to 24 Nov 1892
  Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
  Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
2 to 3 years:
  John Thompson: 05 Dec 1892 to 12 Dec 1894
  Paul Martin: 12 Dec 2003 to 05 Feb 2006
3 to 4 years:
  William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
4 to 5 years:
  Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
  William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
  Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
5 to 6 years:
  Richard Bennett: 07 Aug 1930 to 23 Oct 1935
  John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
  Lester Pearson: 22 Apr 1963 to 20 Apr 1968
6 to 7 years:
  John Macdonald: 01 Jul 1867 to 05 Nov 1873
  Stephen Harper: 06 Feb 2006 to 25 May 2012
8 to 9 years:
  Robert Borden: 10 Oct 1911 to 10 Jul 1920
  Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
  Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
10 to 11 years:
  Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
11 to 12 years:
  Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
12 to 13 years:
  John Macdonald: 17 Oct 1878 to 06 Jun 1891
13 to 14 years:
  William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
15 to 16 years:
  Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911

Custom return types

A GroupBy() call also allows you to customize which data fields should be returned, in much the same way as Join() did. For example, if we want to group the terms into the decades in which they started (as we did in the last post), we can have GroupBy() return only the last name and start date for each term. The code is:

      var pmList38 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.start.Year / 10,
          pmTerm => new
          {
            last = pmTerm.last,
            start = pmTerm.start
          })
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList38)
      {
        Console.WriteLine("{0}s:", (pmGroup.Key * 10));
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0}: {1:dd MMM yyyy}",
            pmTerm.last, pmTerm.start);
        }
      }

In this case, the second argument of GroupBy() is a function that takes a single parameter (pmTerm here) which is used to construct the returned object to be placed in the group. Here, each object in a group will be an anonymous type with two fields: last and start. We use these two fields in the printout, and we get:

1860s:
  Macdonald: 01 Jul 1867
1870s:
  Mackenzie: 07 Nov 1873
  Macdonald: 17 Oct 1878
1890s:
  Abbott: 16 Jun 1891
  Thompson: 05 Dec 1892
  Bowell: 21 Dec 1894
  Tupper: 01 May 1896
  Laurier: 11 Jul 1896
1910s:
  Borden: 10 Oct 1911
1920s:
  Meighen: 10 Jul 1920
  Mackenzie King: 29 Dec 1921
  Meighen: 29 Jun 1926
  Mackenzie King: 25 Sep 1926
1930s:
  Bennett: 07 Aug 1930
  Mackenzie King: 23 Oct 1935
1940s:
  St. Laurent: 15 Nov 1948
1950s:
  Diefenbaker: 21 Jun 1957
1960s:
  Pearson: 22 Apr 1963
  Trudeau: 20 Apr 1968
1970s:
  Clark: 04 Jun 1979
1980s:
  Trudeau: 03 Mar 1980
  Turner: 30 Jun 1984
  Mulroney: 17 Sep 1984
1990s:
  Campbell: 25 Jun 1993
  Chrétien: 04 Nov 1993
2000s:
  Martin: 12 Dec 2003
  Harper: 06 Feb 2006

Result selection

Finally, we can ask GroupBy() to return a single object for each group, rather than the entire group. For example, suppose we want a count of the number of terms that started in each decade, together with the earliest term in each decade. We can do that as follows:

      var pmList39 = terms
        .OrderBy(term => term.start)
        .GroupBy(term => term.start.Year / 10,
          (year, termGroup) => new
          {
            decade = year * 10,
            number = termGroup.Count(),
            earliest = termGroup.Min(term => term.start)
          });
      Console.WriteLine("*** pmList39");
      foreach (var term in pmList39)
      {
        Console.WriteLine("{0}s:\n  {1} terms\n  Earliest: {2: dd MMM yyyy}",
          term.decade, term.number, term.earliest);
      }

In this case, the second argument in GroupBy() is a function which takes two parameters. The first parameter is the key for a given group, and the second parameter is the group itself. We can use this information to construct a summary object for that group. In this example, we create an anonymous object with 3 fields: the decade (calculated from the key ‘year’), the number of terms in that decade (by applying the Count() method to the group), and the earliest term (by applying the Min() method and passing it the start date).

This version of GroupBy() produces a list of single objects rather than a list of groups, so only a single loop is needed to iterate through it. The results are:

1860s:
  1 terms
  Earliest:  01 Jul 1867
1870s:
  2 terms
  Earliest:  07 Nov 1873
1890s:
  5 terms
  Earliest:  16 Jun 1891
1910s:
  1 terms
  Earliest:  10 Oct 1911
1920s:
  4 terms
  Earliest:  10 Jul 1920
1930s:
  2 terms
  Earliest:  07 Aug 1930
1940s:
  1 terms
  Earliest:  15 Nov 1948
1950s:
  1 terms
  Earliest:  21 Jun 1957
1960s:
  2 terms
  Earliest:  22 Apr 1963
1970s:
  1 terms
  Earliest:  04 Jun 1979
1980s:
  3 terms
  Earliest:  03 Mar 1980
1990s:
  2 terms
  Earliest:  25 Jun 1993
2000s:
  2 terms
  Earliest:  12 Dec 2003

Note the differences between these calls to GroupBy(). The first argument is always the key to be used in the grouping. If the second argument is an IEqualityComparer object, it is used to compare keys. If this argument is a function with a single parameter, it is used to select fields from each object placed in the group. Finally, if the argument is a function with two parameters, it is used to produce a summary object for each group.

These 3 features can be used in any combination (which is why there are 8 prototypes for GroupBy(). Whichever features you want to include, remember that they are placed in the order source.GroupBy(keySelector, elementSelector, resultSelector, equalityComparer).

LINQ Groups: Basic Groups

We’ve seen in the last post that LINQ’s Join() operator allows its results to be grouped according to the value of the key used to match pairs from two lists. LINQ offers a much more general grouping facility with the GroupBy() operator. There are actually 8 varieties of GroupBy(), so we’ll have a look at the features that comprise them. In this post, we’ll look at the simplest form of GroupBy() and consider the more advanced features in the next post.

All GroupBy() operators take a single sequence as input (as opposed to Join(), which takes two), and they all require you to specify a key value which is used for dividing the elements of the sequence into groups. The most basic form of GroupBy() does just that, with no frills. As an example, suppose we want a list of Canada’s prime ministers divided into groups according to the first letter of their last names (as might be found in an index). We can do that as follows:

      var pmList33a = primeMinisters.GroupBy(pm => pm.lastName[0]);
      foreach (var pmGroup in pmList33a)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

The single argument of GroupBy() is a function that calculates the key from a sequence element. Since our input sequence primeMinisters contains objects of class PrimeMinisters, we select the lastName field (a string) and take its first element.

A GroupBy() operation returns a sequence of groups rather than a sequence of individual elements. The prototype of this simplest version of GroupBy() is:

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
	this IEnumerable<TSource> source,
	Func<TSource, TKey> keySelector
)

From the return type, we see that GroupBy() returns an IEnumerable sequence, where each element is of type IGrouping<TKey, TSource>. That is, each group consists of a list of objects of type TSource accompanied by a single key value of type TKey. In our example here, TSource is PrimeMinisters and TKey is char.

Because the object returned by GroupBy() is a list of groups, if we want to access the individual elements of each group we need a nested loop; the outer loop iterates over the groups and the inner loop iterates over the elements within each group. Note that we’ve used the Key data field of the group in printing the output; the Key field is present in all IGrouping objects and contains the key value for that particular group. Thus the code above produces this output:

Group M:
  John Macdonald
  Alexander Mackenzie
  Arthur Meighen
  William Mackenzie King
  Brian Mulroney
  Paul Martin
Group A:
  John Abbott
Group T:
  John Thompson
  Charles Tupper
  Pierre Trudeau
  John Turner
Group B:
  Mackenzie Bowell
  Robert Borden
  Richard Bennett
Group L:
  Wilfrid Laurier
Group S:
  Louis St. Laurent
Group D:
  John Diefenbaker
Group P:
  Lester Pearson
Group C:
  Joe Clark
  Kim Campbell
  Jean Chrétien
Group H:
  Stephen Harper

The groups are created in the order they appear in the original sequence (primeMinisters), and the elements within each group are added in the order in which they appear in this sequence as well. That’s why the M group comes first, and the elements within each group are not in alphabetical order.

The simpler form of GroupBy() can be written as a query expression, so the above code would look like this:

      var pmList33 = from pm in primeMinisters
                     group pm by pm.lastName[0];
      foreach (var pmGroup in pmList33)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

The ‘from’ clause specifies the input sequence, and the key selector is given following the ‘by’ keyword.

If we want to order the output so that both the groups and the contents of each group are in alphabetical order, we can do this by adding a couple of orderby clauses. Here’s the result in both syntaxes:

      var pmList34 = from pm in primeMinisters
                     orderby pm.lastName
                     group pm by pm.lastName[0] into pmGroups
                     orderby pmGroups.Key
                     select pmGroups;
      foreach (var pmGroup in pmList34)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

      var pmList34a = primeMinisters
        .OrderBy(pm => pm.lastName)
        .GroupBy(pm => pm.lastName[0])
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList34a)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

The standard query operator form (the second one) is the most straightforward: we first order the overall primeMinisters list, then group it as before, and finally order the output of the GroupBy() by doing an OrderBy() on the keys of the groups.

In the query expression form, we can’t follow a group clause directly by an orderby. We must first save the results of the group operation in a variable specified by the ‘into’ keyword (the same technique was used in a group join in the last post). Thus here we save the result of the group in pmGroups, and then apply orderby to that. The final ‘select pmGroups’ clause selects the group so the final output is a sequence of groups as before. The output from both forms of the code is:

Group A:
  John Abbott
Group B:
  Richard Bennett
  Robert Borden
  Mackenzie Bowell
Group C:
  Kim Campbell
  Jean Chrétien
  Joe Clark
Group D:
  John Diefenbaker
Group H:
  Stephen Harper
Group L:
  Wilfrid Laurier
Group M:
  John Macdonald
  Alexander Mackenzie
  William Mackenzie King
  Paul Martin
  Arthur Meighen
  Brian Mulroney
Group P:
  Lester Pearson
Group S:
  Louis St. Laurent
Group T:
  John Thompson
  Pierre Trudeau
  Charles Tupper
  John Turner

The key used for grouping need not be a simple data field; it can be a calculated value. For example, if we wanted to group the prime ministers’ terms of office into the decades in which they started, we could do something like this:

      var pmList36 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
                      {
                        first = pm.firstName,
                        last = pm.lastName,
                        start = term.start,
                        end = term.end
                      })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.start.Year / 10)
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList36)
      {
        Console.WriteLine("{0}s:", (pmGroup.Key * 10));
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
            pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
        }
      }

The Join() clause connects the list containing the PMs’ names with the list containing their terms. We order this list by the start date of each term, then pass the result into a GroupBy(). Here the key is the year of the start date divided by 10 (using integer division which throws away the remainder). All dates starting in the same decade will be in the same group. The output is:

1860s:
  John Macdonald: 01 Jul 1867 to 05 Nov 1873
1870s:
  Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
  John Macdonald: 17 Oct 1878 to 06 Jun 1891
1890s:
  John Abbott: 16 Jun 1891 to 24 Nov 1892
  John Thompson: 05 Dec 1892 to 12 Dec 1894
  Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
  Charles Tupper: 01 May 1896 to 08 Jul 1896
  Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911
1910s:
  Robert Borden: 10 Oct 1911 to 10 Jul 1920
1920s:
  Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
  William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
  Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
  William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
1930s:
  Richard Bennett: 07 Aug 1930 to 23 Oct 1935
  William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
1940s:
  Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
1950s:
  John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
1960s:
  Lester Pearson: 22 Apr 1963 to 20 Apr 1968
  Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
1970s:
  Joe Clark: 04 Jun 1979 to 02 Mar 1980
1980s:
  Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
  John Turner: 30 Jun 1984 to 16 Sep 1984
  Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
1990s:
  Kim Campbell: 25 Jun 1993 to 03 Nov 1993
  Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
2000s:
  Paul Martin: 12 Dec 2003 to 05 Feb 2006
  Stephen Harper: 06 Feb 2006 to 25 May 2012

As far as I can tell, there isn’t any way of writing this code as a single query expression, since we need to use a ‘select’ to create the output of the first ‘join’, and we can’t follow a ‘select’ with an ‘orderby’. However, it’s easy enough to do the job using two separate commands, and we get:

      var pmList35a = from pm in primeMinisters
                      join term in terms on pm.id equals term.id
                      orderby term.start
                      select new
                      {
                        first = pm.firstName,
                        last = pm.lastName,
                        start = term.start,
                        end = term.end
                      };
      var pmList35b = from pmTerm in pmList35a
                      group pmTerm by pmTerm.start.Year / 10 into pmGroups
                      orderby pmGroups.Key
                      select pmGroups;
      foreach (var pmGroup in pmList35b)
      {
        Console.WriteLine("{0}s:", (pmGroup.Key * 10));
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
            pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
        }
      }

LINQ Group Joins

In the last post, we saw how to use the Join() operator in LINQ. Recall that Join() takes two input sequences and looks for matches between elements in the two sequences based on the equality of a key field that is present in both sequences. The ordinary Join() will produce a separate output element for each matched pair. Thus in our example with Canada’s prime ministers where we used a join to get a list of prime ministers matched with their terms of office, those PMs who had more than one term were represented in the Join() by as many elements as they had terms of office.

Sometimes it’s useful to produce a single list of elements for each value of the key in the first (outer) input sequence. For example, we might want a list of all terms of office attached to each prime minister. This sort of query isn’t normal for a database (there’s no equivalent in SQL for example), but LINQ provides the GroupJoin() which does just that.

To see how it works, suppose we wanted to produce a list of prime ministers where we calculate the total time served for each person. For PMs serving multiple terms, we need to add up the time spans of all the terms for that person. We can do that with the following code:

      var pmList30 = primeMinisters
        .GroupJoin(terms,
          pm => pm.id,
          term => term.id,
          (pm, pmTerms) => new
          {
            first = pm.firstName,
            last = pm.lastName,
            totalDays = pmTerms.Sum(time => (time.end - time.start).Days)
          });
      foreach (var pm in pmList30)
      {
        Console.WriteLine("{0} {1} served a total of {2} days.",
          pm.first, pm.last, pm.totalDays);
      }

Here, primeMinisters is the outer sequence and terms is the inner sequence. As with the regular Join(), the next two arguments specify the keys on which the match should be done, so here we match pm.id with term.id as before. Where a GroupJoin() differs from a Join() is in the last argument. In the Join(), this last argument was a function that took the two terms in a matched pair as arguments and produced some data object calculated from these two objects as output.

In the GroupJoin(), the last argument is a function whose first argument is the element from the outer sequence as before, but this time the second argument is a sequence of elements from the inner sequence that match this outer element. Thus pmTerms contains a list of all terms matching that particular pm.

In this example, we create an anonymous object containing the first and last names of the PM, and then use LINQ’s Sum() operator (which we’ll consider in more detail later – for now, take note that this is a non-deferred operator, so it gets calculated right away) to add up all the TimeSpan objects that we get by calculating the difference between end and start times for a given term of office. The output is:

John Macdonald served a total of 6934 days.
Alexander Mackenzie served a total of 1796 days.
John Abbott served a total of 527 days.
John Thompson served a total of 737 days.
Mackenzie Bowell served a total of 493 days.
Charles Tupper served a total of 68 days.
Wilfrid Laurier served a total of 5564 days.
Robert Borden served a total of 3196 days.
Arthur Meighen served a total of 625 days.
William Mackenzie King served a total of 7826 days.
Richard Bennett served a total of 1903 days.
Louis St. Laurent served a total of 3140 days.
John Diefenbaker served a total of 2131 days.
Lester Pearson served a total of 1825 days.
Pierre Trudeau served a total of 5640 days.
Joe Clark served a total of 272 days.
John Turner served a total of 78 days.
Brian Mulroney served a total of 3202 days.
Kim Campbell served a total of 131 days.
Jean Chrétien served a total of 3689 days.
Paul Martin served a total of 786 days.
Stephen Harper served a total of 2298 days.

A group join can also be written as a query expression. The above code looks like this:

      var pmList31 = from pm in primeMinisters
                     join term in terms on pm.id equals term.id into pmTerms
                     select new
                     {
                       first = pm.firstName,
                       last = pm.lastName,
                       totalDays = pmTerms.Sum(time => (time.end - time.start).Days)
                     };
      foreach (var pm in pmList31)
      {
        Console.WriteLine("{0} {1} served a total of {2} days.",
          pm.first, pm.last, pm.totalDays);
      }

The feature that makes this a group join is the addition of ‘into pmTerms’ at the end of the join clause. This defines pmTerms as the sequence of term objects that matches a given pm.id.

We can also save the entire sequence produced by each match and iterate over it in the usual way. For example, we can produce an annotated list of terms served by each PM as follows.

      var pmList32 = from pm in primeMinisters
                     join term in terms on pm.id equals term.id into pmTerms
                     select new
                     {
                       first = pm.firstName,
                       last = pm.lastName,
                       termsList = pmTerms
                     };
      foreach (var pm in pmList32)
      {
        Console.WriteLine("{0} {1}:", pm.first, pm.last);
        foreach (var term in pm.termsList)
        {
          Console.WriteLine("  {0:dd MMM yyyy} to {1:dd MMM yyyy}", term.start, term.end);
        }
      }

We do the same query as before, but this time we save the entire pmTerms list in the termsList field of the output object. We then use a nested foreach loop to print out the list of terms for each PM:

John Macdonald:
  01 Jul 1867 to 05 Nov 1873
  17 Oct 1878 to 06 Jun 1891
Alexander Mackenzie:
  07 Nov 1873 to 08 Oct 1878
John Abbott:
  16 Jun 1891 to 24 Nov 1892
John Thompson:
  05 Dec 1892 to 12 Dec 1894
Mackenzie Bowell:
  21 Dec 1894 to 27 Apr 1896
Charles Tupper:
  01 May 1896 to 08 Jul 1896
Wilfrid Laurier:
  11 Jul 1896 to 06 Oct 1911
Robert Borden:
  10 Oct 1911 to 10 Jul 1920
Arthur Meighen:
  10 Jul 1920 to 29 Dec 1921
  29 Jun 1926 to 25 Sep 1926
William Mackenzie King:
  29 Dec 1921 to 28 Jun 1926
  25 Sep 1926 to 07 Aug 1930
  23 Oct 1935 to 15 Nov 1948
Richard Bennett:
  07 Aug 1930 to 23 Oct 1935
Louis St. Laurent:
  15 Nov 1948 to 21 Jun 1957
John Diefenbaker:
  21 Jun 1957 to 22 Apr 1963
Lester Pearson:
  22 Apr 1963 to 20 Apr 1968
Pierre Trudeau:
  20 Apr 1968 to 03 Jun 1979
  03 Mar 1980 to 29 Jun 1984
Joe Clark:
  04 Jun 1979 to 02 Mar 1980
John Turner:
  30 Jun 1984 to 16 Sep 1984
Brian Mulroney:
  17 Sep 1984 to 24 Jun 1993
Kim Campbell:
  25 Jun 1993 to 03 Nov 1993
Jean Chrétien:
  04 Nov 1993 to 11 Dec 2003
Paul Martin:
  12 Dec 2003 to 05 Feb 2006
Stephen Harper:
  06 Feb 2006 to 23 May 2012

LINQ Joins

When constructing a database, we often split up the information to be stored into several linked tables in order to avoid duplicating data. Using our Canadian prime minsters as an example, we store the name and political party of each prime minister in one list, and the terms of office in another. By providing each prime minister in the first list with a unique identifier, or key, we can use that key in the list of terms of office so that each term is linked to the prime minister that was in power at the time. That way, we can store as much information as we like about the prime minister without having to duplicate this information in the terms list.

However, we often need information from two or more lists in our output. Since the only link between a prime minister and a term of office is the id number of the prime minister, if we want to print out a list of prime ministers’ names with their corresponding terms of office we need to look up each id number in the first list to get the name to attach to each term of office.

This is such a common operation that most languages (such as SQL) that deal with data have commands for doing it. LINQ is no exception, and in fact the syntax is very similar to SQL: we need to look at the Join() operator.

The most basic type of Join() takes two input sequences called the outer and inner sequences. These sequences must each have a data field, or key, which is of the same data type and can be compared. A Join() will search for all possible matches between the two sequences by comparing these key values and return a list of objects derived from these matching pairs.

All this is a bit hard to understand in the abstract, so we’ll give an example. Suppose we want to build the list described above: from the list of prime ministers and the list of terms of office, we want a list of prime ministers’ names and their terms of office. The code for doing this is

      var pmList27 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        });
      foreach (var pm in pmList27)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

We see that Join() takes 4 arguments in addition to the object that calls it. The calling object (primeMinisters here) is the outer sequence, and the first argument (‘terms’ here) is the inner sequence. The second argument is a function which takes as input an element from the outer sequence (so here it takes ‘pm’ which is a PrimeMinisters object) and calculates the key which is to be used in the comparison. Thus here we use the id field of PrimeMinisters as the key.

The third argument does the same for the inner sequence, so here we’re using the id field of a Terms object as the comparison. The Join() will thus look for all pairs of (PrimeMinisters, Terms) objects where the id keys from the two are equal. This sort of join is called an equijoin since it uses an equality test on the keys (as opposed, say, to an inequality such as ‘greater than’).

The last argument to Join() specifies what is to be returned for each matching pair. This is determined by a function which takes two arguments, the first of which is the element from the outer sequence and the second of which is the element from the inner sequence. Note that you must put these arguments in the right order! We then construct, in this case, an anonymous type containing the first and last names of the prime minister from the pm object and the dates of office from the term object. The overall effect is therefore to produce a list of names and dates for all terms of office. The output is

John Macdonald: 01 Jul 1867 to 05 Nov 1873
John Macdonald: 17 Oct 1878 to 06 Jun 1891
Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
John Abbott: 16 Jun 1891 to 24 Nov 1892
John Thompson: 05 Dec 1892 to 12 Dec 1894
Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
Charles Tupper: 01 May 1896 to 08 Jul 1896
Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911
Robert Borden: 10 Oct 1911 to 10 Jul 1920
Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
Richard Bennett: 07 Aug 1930 to 23 Oct 1935
Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
Lester Pearson: 22 Apr 1963 to 20 Apr 1968
Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
Joe Clark: 04 Jun 1979 to 02 Mar 1980
John Turner: 30 Jun 1984 to 16 Sep 1984
Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
Kim Campbell: 25 Jun 1993 to 03 Nov 1993
Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
Paul Martin: 12 Dec 2003 to 05 Feb 2006
Stephen Harper: 06 Feb 2006 to 23 May 2012

I haven’t been able to find a definitive answer to the question of what order a Join() produces its results. In this case, it appears that the outer sequence is the dominant one, in that the prime ministers are listed in the order they appear in that sequence, with the terms attached to each person. However, if we swap the order in the Join() as in the following, we get the same output:

     var pmList29 = terms
      .Join(primeMinisters, term => term.id, pm => pm.id,
      (term, pm) => new
      {
        first = pm.firstName,
        last = pm.lastName,
        start = term.start,
        end = term.end
      });
      foreach (var pm in pmList29)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

If the outer sequence were the dominant one, we would expect the results to be ordered by the term of office (so that, for example, Mackenzie’s term would come between the two Macdonald terms), but this doesn’t seem to be the case. The moral of the story is that if the order of the output is important to you, you should use an OrderBy() clause to make sure it’s what you want.

A join can also be written as a query expression. The first query above becomes

      var pmList28 = from pm in primeMinisters
                     join term in terms on pm.id equals term.id
                     select new
                      {
                        first = pm.firstName,
                        last = pm.lastName,
                        start = term.start,
                        end = term.end
                      };
      foreach (var pm in pmList28)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

The outer sequence is specified in the ‘from’ clause, while the inner sequence is given at the start of the ‘join’ clause (‘term in terms’ here). After the inner sequence is specified, the keyword ‘on’ introduces the equality test that is to be done. The two keys that are to be compared are given, and separated by the keyword ‘equals’. Note that you need to use this special keyword in a join and not the usual == operator.

A join in a query expression doesn’t have the returned value built in, as it does in standard operator notation. Rather, we use the usual ‘select’ clause to specify what should be returned for each matching pair.

Finally, it’s worth noting that the type of join we did here is an inner join, which means that an element from one sequence appears in the results only if it can be matched with an element from the other sequence. In our case, that’s fine, since all elements in both sequences have matches in the other, but in some cases we’d like to know if there is an element in one sequence that has no match in the other. To do that we need to do an outer join, but that’s a topic for a future post.

LINQ sorting

We’ve already seen how to sort or order sequences in LINQ in a simple case. Using our list of Canada’s prime ministers, we can sort them by last name like this:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmList25 = primeMinisters
        .OrderBy(pm => pm.lastName);
      foreach (var pm in pmList25)
      {
        Console.WriteLine(pm);
      }

The OrderBy() command takes a single argument which is a function that returns the data field on which sorting should take place. This data field must be of a type that implements the IComparer<T> interface. This interface requires a Compare(T x, T y) method to be defined which takes as its input two objects of type T and returns +1, 0 or -1 depending on whether x is greater than, equal to or less than y. It is up to Compare() to specify what ‘greater than’, ‘equal to’ and ‘less than’ mean. All the fundamental data types such as int, float and so on can be used in this way, as can many other standard types such as string and DateTime. For a string, ordering is done alphabetically and for DateTime it is done chronologically, as you’d expect.

OrderBy() has an equivalent in query expression syntax, so we could write the above as:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmList26 = from pm in primeMinisters
                     orderby pm.lastName
                     select pm;
      foreach (var pm in pmList26)
      {
        Console.WriteLine(pm);
      }

The output of both these forms is:

3. John Abbott (Conservative)
11. Richard Bennett (Conservative)
8. Robert Borden (Conservative)
5. Mackenzie Bowell (Conservative)
19. Kim Campbell (Conservative)
20. Jean Chrétien (Liberal)
16. Joe Clark (Conservative)
13. John Diefenbaker (Conservative)
22. Stephen Harper (Conservative)
7. Wilfrid Laurier (Liberal)
1. John Macdonald (Conservative)
2. Alexander Mackenzie (Liberal)
10. William Mackenzie King (Liberal)
21. Paul Martin (Liberal)
9. Arthur Meighen (Conservative)
18. Brian Mulroney (Conservative)
14. Lester Pearson (Liberal)
12. Louis St. Laurent (Liberal)
4. John Thompson (Conservative)
15. Pierre Trudeau (Liberal)
6. Charles Tupper (Conservative)
17. John Turner (Liberal)

By default, OrderBy() sorts in ascending order; if you want descending order, there is an OrderByDescending() command that takes the same argument. In query expression syntax we add the keyword ‘descending’ after the ‘orderby’ clause:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmList26 = from pm in primeMinisters
                     orderby pm.lastName descending
                     select pm;
      foreach (var pm in pmList26)
      {
        Console.WriteLine(pm);
      }

Although this simple form is adequate for most needs, sometimes we need to define our own sorting criterion. There is a second form of OrderBy() (and OrderByDescending()) that allows this. For example, suppose we wanted to sort the prime ministers by the length of term they served. In this case, we need a custom comparer that compares two Terms objects and defines one as ‘greater than’ the other if the duration of the term is larger.

To do this, we need to write our own class that implements IComparer<Terms>. We need write only a single method to do this, so we get:

  class TermComparer : IComparer<Terms>
  {
    public int Compare(Terms x, Terms y)
    {
      TimeSpan duration1 = x.end - x.start;
      TimeSpan duration2 = y.end - y.start;

      return duration1 > duration2 ? 1 :
        duration2 == duration1 ? 0 : -1;
    }
  }

The ‘end’ and ‘start’ fields of a Terms object are DateTime objects, and the minus operator is overloaded for the DateTime class to yield a TimeSpan. Also, the > operator is overloaded for a TimeSpan so we can use it directly in writing our customized Compare() method.

With this class, we can now write our query, although when using this form of OrderBy(), we must use standard query syntax as there is no equivalent query expression.

      TermComparer termComparer = new TermComparer();
      var pmList22 = primeMinisters
        .SelectMany(pm => terms
          .Where(term => term.id == pm.id)
          .Select(term => new
          {
            firstName = pm.firstName,
            surname = pm.lastName,
            inOffice = term
          }))
          .OrderBy((pmTerm => pmTerm.inOffice), termComparer);
      foreach (var pmTerm in pmList22)
      {
        TimeSpan duration = pmTerm.inOffice.end - pmTerm.inOffice.start;
        Console.WriteLine(pmTerm.firstName + " " + pmTerm.surname +
          ": {0:dd MMM yyyy} to {1:dd MMM yyyy} ({2} days).",
          pmTerm.inOffice.start, pmTerm.inOffice.end, duration.Days);
      }

The beginning of this query is the same as in the last post, where we wrote a query that matched up a prime minister’s name with their terms of office. The difference here is in the OrderBy() method at the end of the query. We give the inOffice field as the object on which ordering is to be done. Since this is a Terms object, we can use a TermComparer to do the comparison. We then pass the termComparer object to OrderBy(), and print out the results, nicely formatted. The results are:

Charles Tupper: 01 May 1896 to 08 Jul 1896 (68 days).
John Turner: 30 Jun 1984 to 16 Sep 1984 (78 days).
Arthur Meighen: 29 Jun 1926 to 25 Sep 1926 (88 days).
Kim Campbell: 25 Jun 1993 to 03 Nov 1993 (131 days).
Joe Clark: 04 Jun 1979 to 02 Mar 1980 (272 days).
Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896 (493 days).
John Abbott: 16 Jun 1891 to 24 Nov 1892 (527 days).
Arthur Meighen: 10 Jul 1920 to 29 Dec 1921 (537 days).
John Thompson: 05 Dec 1892 to 12 Dec 1894 (737 days).
Paul Martin: 12 Dec 2003 to 05 Feb 2006 (786 days).
William Mackenzie King: 25 Sep 1926 to 07 Aug 1930 (1412 days).
Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984 (1579 days).
William Mackenzie King: 29 Dec 1921 to 28 Jun 1926 (1642 days).
Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878 (1796 days).
Lester Pearson: 22 Apr 1963 to 20 Apr 1968 (1825 days).
Richard Bennett: 07 Aug 1930 to 23 Oct 1935 (1903 days).
John Diefenbaker: 21 Jun 1957 to 22 Apr 1963 (2131 days).
Stephen Harper: 06 Feb 2006 to 22 May 2012 (2297 days).
John Macdonald: 01 Jul 1867 to 05 Nov 1873 (2319 days).
Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957 (3140 days).
Robert Borden: 10 Oct 1911 to 10 Jul 1920 (3196 days).
Brian Mulroney: 17 Sep 1984 to 24 Jun 1993 (3202 days).
Jean Chrétien: 04 Nov 1993 to 11 Dec 2003 (3689 days).
Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979 (4061 days).
John Macdonald: 17 Oct 1878 to 06 Jun 1891 (4615 days).
William Mackenzie King: 23 Oct 1935 to 15 Nov 1948 (4772 days).
Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911 (5564 days).

Note that, in order to write the class that implements IComparer, we need to know the data type of the object on which comparisons are to be done. We therefore couldn’t have written the OrderBy() so that comparisons are done on the pmTerm itself, since this is an anonymous type.

Stable and unstable commands

An important point about OrderBy() is that it is an unstable command. This means that, apart from the data field on which sorting is being done, the command makes no guarantees about preserving the order of the input sequence. In particular, if two elements in the input have the same value of the key field on which sorting is being done, these two elements might appear in the same order as the input sequence, or they might be swapped around. In practice, most of the time the initial order of such elements is preserved, but you should never write code that assumes this to be the case.

This has particular importance in the case where we want to do more than one sort on a sequence. For example, several of Canada’s prime ministers have the first name John, so we might want to sort the list by first name, and then by last name. We expect the result to retain the ordering by first name, so the second sort should rearrange the elements only for those whose first name is John. Thus the following code will not do what we want:

      var pmList23 = primeMinisters
        .OrderBy(pm => pm.firstName)
        .OrderBy(pm => pm.lastName);
      foreach (var pm in pmList23)
      {
        Console.WriteLine(pm);
      }

The second OrderBy() wipes out the effects of the first, and we end up with a list sorted by last name only.

To fix this, we need to understand that the output of OrderBy() is a special type of sequence that implements the IOrderedEnumerable<T> interface, which inherits IEnumerable<T>. This type of sequence can be passed to a ThenBy() command. Effectively, the field on which the first OrderBy() was run is marked, and ThenBy() will not reorder any of these elements except in cases where the first keys were equal. This makes ThenBy() a stable command. ThenBy() requires an IOrderedEnumerable<T> as input; you’ll get a compilation error if you try to feed it an ordinary IEnumerable<T>.

Thus the solution to our problem is to feed the output of the first OrderBy() into a ThenBy() command:

      var pmList23 = primeMinisters
        .OrderBy(pm => pm.firstName)
        .ThenBy(pm => pm.lastName);
      foreach (var pm in pmList23)
      {
        Console.WriteLine(pm);
      }

The output is what we’d expect:

2. Alexander Mackenzie (Liberal)
9. Arthur Meighen (Conservative)
18. Brian Mulroney (Conservative)
6. Charles Tupper (Conservative)
20. Jean Chrétien (Liberal)
16. Joe Clark (Conservative)
3. John Abbott (Conservative)
13. John Diefenbaker (Conservative)
1. John Macdonald (Conservative)
4. John Thompson (Conservative)
17. John Turner (Liberal)
19. Kim Campbell (Conservative)
14. Lester Pearson (Liberal)
12. Louis St. Laurent (Liberal)
5. Mackenzie Bowell (Conservative)
21. Paul Martin (Liberal)
15. Pierre Trudeau (Liberal)
11. Richard Bennett (Conservative)
8. Robert Borden (Conservative)
22. Stephen Harper (Conservative)
7. Wilfrid Laurier (Liberal)
10. William Mackenzie King (Liberal)

The ordering by first name is preserved, and the Johns are correctly ordered by last name as well.

There is no ‘thenby’ keyword in query expression syntax, but it is possible to do successive orderings by adding the sort keys in a comma-separated list. Thus the above query could be written as:

      var pmList24 = from pm in primeMinisters
                     orderby pm.firstName, pm.lastName
                     select pm;
      foreach (var pm in pmList24)
      {
        Console.WriteLine(pm);
      }

ThenBy() also allows a custom comparer to be defined in the same way as OrderBy(), and there is also a ThenByDescending() command.

LINQ Take and Skip

A couple of useful LINQ commands are Take and Skip, together with their variants TakeWhile and SkipWhile. They are quite simple commands, but are available only as standard query operators (at least in C#; there are query expression versions in Visual Basic).

Both of these commands require an IEnumerable<T> as input. Take() takes an int argument and returns that number of elements starting at the beginning of the input. Skip() is essentially the opposite of Take(), in that it takes an int argument and skips that number of elements, returning the remainder of the sequence.

We’ll illustrate Take() with a few examples using our list of Canada’s prime ministers (see last post and links from there). First, a simple example showing how to return the first 10 prime ministers.

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmList12 = primeMinisters.Take(10);
      foreach (var pm in pmList12)
      {
        Console.WriteLine("{0}. {1} {2}", pm.id, pm.firstName, pm.lastName);
      }

The output is just the first 10 men from the list, formatted nicely in the output:

1. John Macdonald
2. Alexander Mackenzie
3. John Abbott
4. John Thompson
5. Mackenzie Bowell
6. Charles Tupper
7. Wilfrid Laurier
8. Robert Borden
9. Arthur Meighen
10. William Mackenzie King

Although we can’t use ‘take’ in a query expression, it is possible to combine the query expression and standard operator syntax if we need to. Thus we could rewrite the above code as:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmList13 = (from pm in primeMinisters
                        select pm).Take(10);
      foreach (var pm in pmList13)
      {
        Console.WriteLine("{0}. {1} {2}", pm.id, pm.firstName, pm.lastName);
      }

The query expression portion of the command is enclosed in parentheses and the returned value from this command is used as the input to Take().

As a slightly more involved example, suppose we wanted to print out a list of the first 10 terms of office, ordered by date. The list we produced in the last post was ordered by the id number of the prime ministers and since some of them served more than one term, the dates aren’t in the correct order.

We can do this by using the OrderBy() command, which we’ll treat in more detail later. We can use the code from the last post and add a couple of lines to get what we want:

      var pmList14 = primeMinisters
        .SelectMany(pm => terms
          .Where(term => term.id == pm.id)
          .Select(term => new
          {
            surname = pm.lastName,
            inOffice = term
          }))
        .OrderBy(pmTerm => pmTerm.inOffice.start)
        .Take(10);
      foreach (var pmTerm in pmList14)
      {
        Console.WriteLine(pmTerm.surname + ": {0:dd MMM yyyy} to {1:dd MMM yyyy}",
          pmTerm.inOffice.start, pmTerm.inOffice.end);
      }

Recall that the SelectMany() operator here returns a list of sequences, where each sequence in the list contains a list of terms for a given id number. We take the output from SelectMany() and feed it into OrderBy(). The argument of OrderBy() here is a lambda expression giving the value on which the sort should be done. Since the ‘start’ field is an object of C#’s DateTime class and this has a built-in comparer, we can just pass a DateTime as an argument to OrderBy(). If the data type on which we wished to sort did not have a default comparer, we’d have to provide one, but we’ll leave that until we consider OrderBy() in more detail.

The output from this code is:

Macdonald: 01 Jul 1867 to 05 Nov 1873
Mackenzie: 07 Nov 1873 to 08 Oct 1878
Macdonald: 17 Oct 1878 to 06 Jun 1891
Abbott: 16 Jun 1891 to 24 Nov 1892
Thompson: 05 Dec 1892 to 12 Dec 1894
Bowell: 21 Dec 1894 to 27 Apr 1896
Tupper: 01 May 1896 to 08 Jul 1896
Laurier: 11 Jul 1896 to 06 Oct 1911
Borden: 10 Oct 1911 to 10 Jul 1920
Meighen: 10 Jul 1920 to 29 Dec 1921

You can see that the two terms served by Macdonald are split by the term from Mackenzie, so the ordering on the start date has worked.

The other version of the Take command is TakeWhile(), which takes a boolean predicate as its argument instead of an int. TakeWhile() will return values from the input sequence as long as the predicate is true. Note that TakeWhile() will stop returning values as soon as it encounters an element for which the predicate is false, even if some later members of the sequence would return true.

For example, suppose we want a list of terms of office that started before 1900. We could write it like this:

      var pmList15 = primeMinisters
        .SelectMany(pm => terms
          .Where(term => term.id == pm.id)
          .Select(term => new
          {
            surname = pm.lastName,
            inOffice = term
          }))
        .OrderBy(pmTerm => pmTerm.inOffice.start)
        .TakeWhile(pmTerm => pmTerm.inOffice.start < DateTime.Parse("1900/1/1"));
      foreach (var pmTerm in pmList15)
      {
        Console.WriteLine(pmTerm.surname + ": {0:dd MMM yyyy} to {1:dd MMM yyyy}",
          pmTerm.inOffice.start, pmTerm.inOffice.end);
      }

The code is the same as the previous example except for the TakeWhile() statement, which has its argument that predicate that the start date must be before Jan 1 1900. Note that the < operator is overloaded for DateTime; in any custom data type we’d have to provide this overloaded operator ourselves.

The output from this code is:

Macdonald: 01 Jul 1867 to 05 Nov 1873
Mackenzie: 07 Nov 1873 to 08 Oct 1878
Macdonald: 17 Oct 1878 to 06 Jun 1891
Abbott: 16 Jun 1891 to 24 Nov 1892
Thompson: 05 Dec 1892 to 12 Dec 1894
Bowell: 21 Dec 1894 to 27 Apr 1896
Tupper: 01 May 1896 to 08 Jul 1896
Laurier: 11 Jul 1896 to 06 Oct 1911

TakeWhile() has a second form in which the predicate takes two arguments, with the second argument being an int that represents the index of the element in the input sequence. For example, if we want to modify the search in the last example so that it returns a list of terms before 1900 or the first five, whichever is shorter, we can write:

      var pmList16 = primeMinisters
        .SelectMany(pm => terms
          .Where(term => term.id == pm.id)
          .Select(term => new
          {
            surname = pm.lastName,
            inOffice = term
          }))
        .OrderBy(date => date.inOffice.start)
        .TakeWhile((pmTerm, num) =>
          pmTerm.inOffice.start < DateTime.Parse("1900/1/1") &&
          num < 5);
      foreach (var pmTerm in pmList16)
      {
        Console.WriteLine(pmTerm.surname + ": {0:dd MMM yyyy} to {1:dd MMM yyyy}",
          pmTerm.inOffice.start, pmTerm.inOffice.end);
      }

Here, ‘num’ is the zero-based index of the element in the input, so the TakeWhile() returns elements until the date passes 1900 or num is 5 or greater. Since the second condition will fail first, the output is:

Macdonald: 01 Jul 1867 to 05 Nov 1873
Mackenzie: 07 Nov 1873 to 08 Oct 1878
Macdonald: 17 Oct 1878 to 06 Jun 1891
Abbott: 16 Jun 1891 to 24 Nov 1892
Thompson: 05 Dec 1892 to 12 Dec 1894

Skip() works in much the same way as Take() so we’ll give just a few examples of it. If we wanted to return the last 5 elements of the list, we could do it like this:

      var pmList17 = primeMinisters.Skip(primeMinisters.Count() - 5);
      foreach (var pm in pmList17)
      {
        Console.WriteLine("{0}. {1} {2}", pm.id, pm.firstName, pm.lastName);
      }

We’ve used the Count() method to get the number of elements in primeMinisters, and then we skip over the first Count – 5 elements and return the rest. The output is

18. Brian Mulroney
19. Kim Campbell
20. Jean Chrétien
21. Paul Martin
22. Stephen Harper

If we wanted a list of all terms of office after 1900, we could use SkipWhile():

      var pmList18 = primeMinisters
        .SelectMany(pm => terms
          .Where(term => term.id == pm.id)
          .Select(term => new
          {
            surname = pm.lastName,
            inOffice = term
          }))
        .OrderBy(pmTerm => pmTerm.inOffice.start)
        .SkipWhile(pmTerm => pmTerm.inOffice.start < DateTime.Parse("1900/1/1"));
      foreach (var pmTerm in pmList18)
      {
        Console.WriteLine(pmTerm.surname + ": {0:dd MMM yyyy} to {1:dd MMM yyyy}",
          pmTerm.inOffice.start, pmTerm.inOffice.end);
      }

This code is identical to the first TakeWhile() example above, except that we’ve replaced the call to TakeWhile() with one to SkipWhile(). The output is:

Borden: 10 Oct 1911 to 10 Jul 1920
Meighen: 10 Jul 1920 to 29 Dec 1921
Mackenzie King: 29 Dec 1921 to 28 Jun 1926
Meighen: 29 Jun 1926 to 25 Sep 1926
Mackenzie King: 25 Sep 1926 to 07 Aug 1930
Bennett: 07 Aug 1930 to 23 Oct 1935
Mackenzie King: 23 Oct 1935 to 15 Nov 1948
St. Laurent: 15 Nov 1948 to 21 Jun 1957
Diefenbaker: 21 Jun 1957 to 22 Apr 1963
Pearson: 22 Apr 1963 to 20 Apr 1968
Trudeau: 20 Apr 1968 to 03 Jun 1979
Clark: 04 Jun 1979 to 02 Mar 1980
Trudeau: 03 Mar 1980 to 29 Jun 1984
Turner: 30 Jun 1984 to 16 Sep 1984
Mulroney: 17 Sep 1984 to 24 Jun 1993
Campbell: 25 Jun 1993 to 03 Nov 1993
Chrétien: 04 Nov 1993 to 11 Dec 2003
Martin: 12 Dec 2003 to 05 Feb 2006
Harper: 06 Feb 2006 to 19 May 2012

If we wanted the first 5 terms after 1900 we could just add a Take(5) after the SkipWhile() above.

SkipWhile() also has a second form in which the index of each input element is passed to the predicate.

LINQ: SelectMany

The LINQ Select command that we’ve used so far returns a single object for each input object that is passed to it. In some cases, we would like to return several objects (or sometimes no objects at all) for a given input.

As an example, remember the list of Canada’s prime ministers that we’ve been using. Suppose we construct another list, this time of the terms of office of each prime minister. We use the id number of the prime minister to label each term of office, and then give the start and end dates of each term of office. Since some prime ministers served more than one term, there are multiple entries for some ids.

The class defining the terms of office is

  class Terms
  {
    public int id;
    public DateTime start, end;

    public static ArrayList GetTermsArrayList()
    {
      ArrayList terms = new ArrayList();

      terms.Add(new Terms { id = 1, start = DateTime.Parse("1867/7/1"), end = DateTime.Parse("1873/11/5") });
      terms.Add(new Terms { id = 1, start = DateTime.Parse("1878/10/17"), end = DateTime.Parse("1891/6/6") });
      terms.Add(new Terms { id = 2, start = DateTime.Parse("1873/11/7"), end = DateTime.Parse("1878/10/8") });
      terms.Add(new Terms { id = 3, start = DateTime.Parse("1891/6/16"), end = DateTime.Parse("1892/11/24") });
      terms.Add(new Terms { id = 4, start = DateTime.Parse("1892/12/5"), end = DateTime.Parse("1894/12/12") });
      terms.Add(new Terms { id = 5, start = DateTime.Parse("1894/12/21"), end = DateTime.Parse("1896/4/27") });
      terms.Add(new Terms { id = 6, start = DateTime.Parse("1896/5/1"), end = DateTime.Parse("1896/7/8") });
      terms.Add(new Terms { id = 7, start = DateTime.Parse("1896/7/11"), end = DateTime.Parse("1911/10/6") });
      terms.Add(new Terms { id = 8, start = DateTime.Parse("1911/10/10"), end = DateTime.Parse("1920/7/10") });
      terms.Add(new Terms { id = 9, start = DateTime.Parse("1920/7/10"), end = DateTime.Parse("1921/12/29") });
      terms.Add(new Terms { id = 9, start = DateTime.Parse("1926/6/29"), end = DateTime.Parse("1926/9/25") });
      terms.Add(new Terms { id = 10, start = DateTime.Parse("1921/12/29"), end = DateTime.Parse("1926/6/28") });
      terms.Add(new Terms { id = 10, start = DateTime.Parse("1926/9/25"), end = DateTime.Parse("1930/8/7") });
      terms.Add(new Terms { id = 10, start = DateTime.Parse("1935/10/23"), end = DateTime.Parse("1948/11/15") });
      terms.Add(new Terms { id = 11, start = DateTime.Parse("1930/8/7"), end = DateTime.Parse("1935/10/23") });
      terms.Add(new Terms { id = 12, start = DateTime.Parse("1948/11/15"), end = DateTime.Parse("1957/6/21") });
      terms.Add(new Terms { id = 13, start = DateTime.Parse("1957/6/21"), end = DateTime.Parse("1963/4/22") });
      terms.Add(new Terms { id = 14, start = DateTime.Parse("1963/4/22"), end = DateTime.Parse("1968/4/20") });
      terms.Add(new Terms { id = 15, start = DateTime.Parse("1968/4/20"), end = DateTime.Parse("1979/6/3") });
      terms.Add(new Terms { id = 15, start = DateTime.Parse("1980/3/3"), end = DateTime.Parse("1984/6/29") });
      terms.Add(new Terms { id = 16, start = DateTime.Parse("1979/6/4"), end = DateTime.Parse("1980/3/2") });
      terms.Add(new Terms { id = 17, start = DateTime.Parse("1984/6/30"), end = DateTime.Parse("1984/9/16") });
      terms.Add(new Terms { id = 18, start = DateTime.Parse("1984/9/17"), end = DateTime.Parse("1993/6/24") });
      terms.Add(new Terms { id = 19, start = DateTime.Parse("1993/6/25"), end = DateTime.Parse("1993/11/3") });
      terms.Add(new Terms { id = 20, start = DateTime.Parse("1993/11/4"), end = DateTime.Parse("2003/12/11") });
      terms.Add(new Terms { id = 21, start = DateTime.Parse("2003/12/12"), end = DateTime.Parse("2006/2/5") });
      terms.Add(new Terms { id = 22, start = DateTime.Parse("2006/2/6"), end = DateTime.Now });

      return terms;
    }

    public override string ToString()
    {
      return id + ". " + start.ToString("ddd dd MMM yyyy") + " - " + end.ToString("ddd dd MMM yyyy");
    }

    public static Terms[] GetTermsArray()
    {
      return (Terms[])GetTermsArrayList().ToArray(typeof(Terms));
    }
  }

Now, how do we construct a query that will return the terms of office for each prime minister, and label each term with the PM’s name rather than just the id number?

If you’re familiar with using a join in SQL you’ll probably be thinking this is a ‘join’ problem and you’d be right. There is a join clause in LINQ which we’ll get to in due course, but for now, we’ll look at another way of doing it.

The point here is that, since some PMs have more than one term, in some cases we’ll need to return more than one result for a given id. Thus a simple ‘select’ won’t work. LINQ provides the SelectMany() operator for such a purpose.

SelectMany() works like Select() in that it takes a single object as input and returns a result. However, instead of a single object as its output, it returns an IEnumerable<T> sequence which can contain any number (even zero) of objects. It’s easiest to see how it works with an example. Here’s the query that gives the result we wanted above:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      Terms[] terms = Terms.GetTermsArray();
      var pmList10 = primeMinisters
        .SelectMany(pm => terms
          .Where(term => term.id == pm.id)
          .Select(term => new
          {
            surname = pm.lastName,
            inOffice = term
          }));
      foreach (var pmTerm in pmList10)
      {
        Console.WriteLine(pmTerm.surname + ": {0:dd MMM yyyy} to {1:dd MMM yyyy}",
          pmTerm.inOffice.start, pmTerm.inOffice.end);
      }

We extract the two data structures from their corresponding classes as usual. Now look at the call to SelectMany(). Since it’s called from primeMinisters, each object fed to it as input is of data type PrimeMinisters. We want to return a list of terms for that PM as output.

To do this, we start with the ‘terms’ sequence and call a Where() clause on it. Since this Where() is inside the SelectMany() call, it has access to the pm object passed to it, as well as the current ‘term’ object from the terms sequence. The Where() tests for the condition that the id field in the term object is equal to the id field in the pm object. This is effectively a join between the two data sources, since we are selecting only those elements from the terms sequence whose id matches the current pm object.

Once we’ve applied this filter, we call Select() (the ordinary Select(), since we need return only a single item for any given pair of pm and term objects) to construct an anonymous object consisting of the current PM’s last name and the entire ‘term’ object, which contains the start and end dates for the term.

Thus for each pm object passed to SelectMany(), a sequence of results (the terms corresponding to that pm) is constructed, so that’s what SelectMany() returns. The results of all the calls to SelectMany() are concatenated together to form the final output which is stored in pmList10.

This time, rather than print out the raw anonymous object, we’ve selected the fields we want and formatted them a bit more nicely than before.

The results of this code are:

Macdonald: 01 Jul 1867 to 05 Nov 1873
Macdonald: 17 Oct 1878 to 06 Jun 1891
Mackenzie: 07 Nov 1873 to 08 Oct 1878
Abbott: 16 Jun 1891 to 24 Nov 1892
Thompson: 05 Dec 1892 to 12 Dec 1894
Bowell: 21 Dec 1894 to 27 Apr 1896
Tupper: 01 May 1896 to 08 Jul 1896
Laurier: 11 Jul 1896 to 06 Oct 1911
Borden: 10 Oct 1911 to 10 Jul 1920
Meighen: 10 Jul 1920 to 29 Dec 1921
Meighen: 29 Jun 1926 to 25 Sep 1926
Mackenzie King: 29 Dec 1921 to 28 Jun 1926
Mackenzie King: 25 Sep 1926 to 07 Aug 1930
Mackenzie King: 23 Oct 1935 to 15 Nov 1948
Bennett: 07 Aug 1930 to 23 Oct 1935
St. Laurent: 15 Nov 1948 to 21 Jun 1957
Diefenbaker: 21 Jun 1957 to 22 Apr 1963
Pearson: 22 Apr 1963 to 20 Apr 1968
Trudeau: 20 Apr 1968 to 03 Jun 1979
Trudeau: 03 Mar 1980 to 29 Jun 1984
Clark: 04 Jun 1979 to 02 Mar 1980
Turner: 30 Jun 1984 to 16 Sep 1984
Mulroney: 17 Sep 1984 to 24 Jun 1993
Campbell: 25 Jun 1993 to 03 Nov 1993
Chrétien: 04 Nov 1993 to 11 Dec 2003
Martin: 12 Dec 2003 to 05 Feb 2006
Harper: 06 Feb 2006 to 18 May 2012

SelectMany() is one of those commands that is available only in standard query operator form; there is no query expression version of it. However, it is still possible to write the same query using query expressions, so here it is:

      var pmList11 = from pm in primeMinisters
                     from term in terms
                     where term.id == pm.id
                     select new
                      {
                        surname = pm.lastName,
                        inOffice = term
                      };
      foreach (var pmTerm in pmList11)
      {
        Console.WriteLine(pmTerm.surname + ": {0:dd MMM yyyy} to {1:dd MMM yyyy}",
          pmTerm.inOffice.start, pmTerm.inOffice.end);
      }

Here, we create the two sequences primeMinisters and terms using two ‘from’ clauses, then apply the ‘where’ filter to them, followed by an ordinary ‘select’. The output is the same as before.

LINQ: where and select clauses

In the last post, we gave an overview of LINQ and started coding with a simple example. In this post, we’ll take a closer look at two of the most commonly used clauses: where and select.

We saw select used in the last post, but it can do a few more things than we showed there. We’ll illustrate it by using the same sample data structure: a list of Canada’s prime ministers. In our previous example, we just printed out a list of all the prime ministers. This time, we’d like to select just those prime ministers whose first name is John. We can do this either by using a query expression or the standard query operators. First, the query expression:

      var pmList3 = from pm in primeMinisters
                    where pm.firstName.Equals("John")
                    select pm;
      foreach (PrimeMinisters pm in pmList3)
      {
        Console.WriteLine(pm);
      }

As before, the ‘from’ clause enumerates all the elements in primeMinisters, returning each element in turn as the variable pm. The elements are fed into the ‘where’ clause where they are tested against the condition that pm.firstName must be “John”. The argument of a ‘where’ can be any boolean expression. Finally, we use ‘select’ as before to yield the original object pm. Thus this expression will filter out those elements of primeMinisters such that pm.firstName is John.

Note that we’ve specified the returned value pmList3 as a ‘var’, rather than giving an explicit data type as we did last time. In this case, we could have stated the data type explicitly since we know it must be IEnumerable<PrimeMinisters>, but ‘var’ is a lot easier to type. Remember that ‘var’ creates a new object giving it the data type of whatever object is first assigned to it.

The output of this code is

1. John Macdonald (Conservative)
3. John Abbott (Conservative)
4. John Thompson (Conservative)
13. John Diefenbaker (Conservative)
17. John Turner (Liberal)

Now we’ll look at how to write the same code using standard query operators. We get

      var pmList4 = primeMinisters.Where(pm => pm.firstName.Equals("John"));
      foreach (PrimeMinisters pm in pmList4)
      {
        Console.WriteLine(pm);
      }

The ‘where’ clause is essentially the same, except that we have to use a lambda expression to specify the predicate. Note, though, that here we don’t need a call to ‘Select()’ to finish the command off. If you think about it, the call to select in the query expression above is redundant (or should be) since all it does is just return everything produced by the ‘where’ clause. However, all query expressions demand a ‘select’ at the end, so we have to put one in.

The contents of pmList4 are the same as pmList3.

Next, a brief illustration of a compound predicate in the ‘where’ clause. We want a list of prime ministers whose first name is John and who are Conservative. We get (showing both the query expression and standard query forms):

      var pmList5 = from pm in primeMinisters
                    where pm.firstName.Equals("John") && pm.party.Equals("Conservative")
                    select pm;
      foreach (PrimeMinisters pm in pmList5)
      {
        Console.WriteLine(pm);
      }

      var pmList6 = primeMinisters.Where(pm => pm.firstName.Equals("John") && pm.party.Equals("Conservative"));
      foreach (PrimeMinisters pm in pmList6)
      {
        Console.WriteLine(pm);
      }

To combine boolean expressions, we use the usual logical operators from C#: && for a logical AND and || for a logical OR. The predicate can consist of as many of these statements as you want to string together.

The output of both of these bits of code is:

1. John Macdonald (Conservative)
3. John Abbott (Conservative)
4. John Thompson (Conservative)
13. John Diefenbaker (Conservative)

Now we’ll branch out a bit and see what else the ‘select’ can do. In the next example, we again look for men named John, but this time we want to print out only the id and last name of each man. We could do this in the WriteLine call of course by merely selecting the corresponding fields, but let’s do it slightly differently. We’ll have ‘select’ construct an object from an anonymous class that contains just the two bits of data we want. Here’s the code in both forms:

      var pmList7 = from pm in primeMinisters
                    where pm.firstName.Equals("John")
                    select new
                    {
                      id = pm.id,
                      surname = pm.lastName
                    };
      foreach (var name in pmList7)
      {
        Console.WriteLine(name);
      }

      var pmList8 = primeMinisters.Where(pm => pm.firstName.Equals("John")).
                    Select(pm => new
                    {
                      id = pm.id,
                      surname = pm.lastName
                    });
      foreach (var name in pmList8)
      {
        Console.WriteLine(name);
      }

We construct an object with an ‘id’ and ‘surname’ fields for each ‘pm’ object that passes through the ‘where’ filter. In the query expression, we can use ‘pm’ straight off since the same object is available all the way through the expression. In the standard query form, we need to provide a lambda expression for ‘select’, as before. The output of either of these queries is

{ id = 1, surname = Macdonald }
{ id = 3, surname = Abbott }
{ id = 4, surname = Thompson }
{ id = 13, surname = Diefenbaker }
{ id = 17, surname = Turner }

In this case, we must use ‘var’ to declare the result of the query, since this result is an IEnumerable list containing objects of an anonymous type, so we don’t know what this type is called internally. However, we do know that each object has an ‘id’ field and a ‘surname’ field, so we can access those if we want. In this example, though, we’ve just printed out the bare object so you can see what happens. When an anonymous object is printed, we get all the fields and their values printed out and enclosed by braces.

The ‘select’ clause has a second form, in which the function that is passed to it contains two arguments rather than the single one we’ve seen so far. In this case, the second argument is the index of the object in the sequence that is passed into the Select() method. If we want to use this form of Select(), we must use the standard query notation as there is no equivalent in a query expression.

As an example, we want to select all men named John and produce a numbered list where the numbers are sequential, rather than the id numbers from the original list. We have

      var pmList9 = primeMinisters.Where(pm => pm.firstName.Equals("John")).
                    Select((pm, index) => new
                    {
                      id = index + 1,
                      surname = pm.lastName
                    });
      foreach (var name in pmList9)
      {
        Console.WriteLine(name);
      }

Note that two arguments are provided in the lambda expression. The ‘index’ is the zero-based index of the element ‘pm’ in the input sequence. We again construct an anonymous object, adding 1 to ‘index’ so we get a 1-based sequence as output. The output is

{ id = 1, surname = Macdonald }
{ id = 2, surname = Abbott }
{ id = 3, surname = Thompson }
{ id = 4, surname = Diefenbaker }
{ id = 5, surname = Turner }

The Where() method also has a two-argument form, where the second argument is the zero-based index of each element in the sequence that is input into the Where(). It too can be used only in standard query form.