Tag Archives: LINQ GroupBy

LINQ Groups: Equality testing and result selection

In the last post we saw how to use LINQ GroupBy() for relatively simple grouping. GroupBy() is capable of a couple of more advanced features which are worth looking at.

Custom equality tests

First, we saw before that the key used by GroupBy() to do the grouping could be calculated from the data fields in the objects in the sequence being grouped, rather than being just one of the bare data fields itself. For simple cases, it’s easiest to just place this calculation directly in the call to GroupBy() as we did earlier. However, sometimes the grouping key gets a bit more complex. LINQ allows us to define our own equality test for use in determining how keys are compared. As an example, suppose we wanted to group the terms of office of Canada’s prime ministers according to how many years each of these terms spanned. That is, we’d like all terms less than a year in one group, then those between 1 and 2 years and so on. Since a Terms object contains only the start and end dates of the term as DateTime objects, we need to calculate the difference to get a TimeSpan object and then declare that two such objects that lie within the same span of years are ‘equal’.

In order to create an equality test, we need to write a custom class that implements the IEqualityComparer<T> interface, where T is the data type being compared. This interface has two methods, Equals(T, T) and GetHashCode(T). The Equals() method returns a bool which is true if its two arguments are defined as equal and false if not. The GetHashCode() method is needed since grouping is done by storing sequence elements in a hash table, so we need to make sure that the hash codes for two elements that are defined as ‘equal’ are the same.

For our example here, we can use the following class:

  class TermEqualityComparer : IEqualityComparer<TimeSpan>
  {
    public bool Equals(TimeSpan x, TimeSpan y)
    {
      return x.Days / 365 == y.Days / 365;
    }

    public int GetHashCode(TimeSpan obj)
    {
      return (obj.Days / 365).GetHashCode();
    }
  }

Our equality test divides the number of days in each TimeSpan object by 365 (OK, we’re ignoring leap years) using integer division. If the two TimeSpans are equal in this measure then they represent terms that lie in the same one-year span.

For the hash code, we just use the same division and return the built-in hash code for the quotient. This ensures that all TimeSpans within the same year get the same hash code.

With this class, we can now write a GroupBy() call that does what we want:

      TermEqualityComparer termEqualityComparer = new TermEqualityComparer();
      var pmList37 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.end - pmTerm.start, termEqualityComparer)
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList37)
      {
        int years = pmGroup.Key.Days / 365;
        Console.WriteLine("{0} to {1} years:", years, years + 1);
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
            pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
        }
      }

We declare a TermEqualityComparer object first. The LINQ code is much the same as in our earlier example in the last post, up to the GroupBy() call. This time it has two arguments. The first is the quantity to be used as the key, as usual, which in this case is the difference between the start and end of the term. The second argument is the equality testing object, so GroupBy() will pass the first argument to the Equals() method in the equality tester for each sequence element and use that test to sort the elements into groups.

You might wonder about the last OrderBy() call, which sorts the groups based on their keys. The actual TimeSpans for each element within a group may all be different, but according to our equality test, all TimeSpans within a single group are ‘equal’, so it doesn’t matter which one is used in the OrderBy().

Where the actual values of the keys does matter though is when we try to use their value in some other calculation. In our example, we want to print out the groups of terms, with each labelled by its key. However, if there is more than one element in a group, the TimeSpan for each element will probably be different, and since only one key is saved for each group, we can’t be sure which element in the group has that key (in fact, it seems to be the first element assigned to the group that has its key used for the group). Thus it’s usually best to use keys only in the same way that the original GroupBy() call did. In our example, we divide pmGroup.Key.Days by 365 to get the year span represented by that key, since we know that value does apply to all elements within that group.

The result of the code is:

0 to 1 years:
  Charles Tupper: 01 May 1896 to 08 Jul 1896
  Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
  Joe Clark: 04 Jun 1979 to 02 Mar 1980
  John Turner: 30 Jun 1984 to 16 Sep 1984
  Kim Campbell: 25 Jun 1993 to 03 Nov 1993
1 to 2 years:
  John Abbott: 16 Jun 1891 to 24 Nov 1892
  Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
  Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
2 to 3 years:
  John Thompson: 05 Dec 1892 to 12 Dec 1894
  Paul Martin: 12 Dec 2003 to 05 Feb 2006
3 to 4 years:
  William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
4 to 5 years:
  Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
  William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
  Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
5 to 6 years:
  Richard Bennett: 07 Aug 1930 to 23 Oct 1935
  John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
  Lester Pearson: 22 Apr 1963 to 20 Apr 1968
6 to 7 years:
  John Macdonald: 01 Jul 1867 to 05 Nov 1873
  Stephen Harper: 06 Feb 2006 to 25 May 2012
8 to 9 years:
  Robert Borden: 10 Oct 1911 to 10 Jul 1920
  Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
  Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
10 to 11 years:
  Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
11 to 12 years:
  Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
12 to 13 years:
  John Macdonald: 17 Oct 1878 to 06 Jun 1891
13 to 14 years:
  William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
15 to 16 years:
  Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911

Custom return types

A GroupBy() call also allows you to customize which data fields should be returned, in much the same way as Join() did. For example, if we want to group the terms into the decades in which they started (as we did in the last post), we can have GroupBy() return only the last name and start date for each term. The code is:

      var pmList38 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.start.Year / 10,
          pmTerm => new
          {
            last = pmTerm.last,
            start = pmTerm.start
          })
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList38)
      {
        Console.WriteLine("{0}s:", (pmGroup.Key * 10));
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0}: {1:dd MMM yyyy}",
            pmTerm.last, pmTerm.start);
        }
      }

In this case, the second argument of GroupBy() is a function that takes a single parameter (pmTerm here) which is used to construct the returned object to be placed in the group. Here, each object in a group will be an anonymous type with two fields: last and start. We use these two fields in the printout, and we get:

1860s:
  Macdonald: 01 Jul 1867
1870s:
  Mackenzie: 07 Nov 1873
  Macdonald: 17 Oct 1878
1890s:
  Abbott: 16 Jun 1891
  Thompson: 05 Dec 1892
  Bowell: 21 Dec 1894
  Tupper: 01 May 1896
  Laurier: 11 Jul 1896
1910s:
  Borden: 10 Oct 1911
1920s:
  Meighen: 10 Jul 1920
  Mackenzie King: 29 Dec 1921
  Meighen: 29 Jun 1926
  Mackenzie King: 25 Sep 1926
1930s:
  Bennett: 07 Aug 1930
  Mackenzie King: 23 Oct 1935
1940s:
  St. Laurent: 15 Nov 1948
1950s:
  Diefenbaker: 21 Jun 1957
1960s:
  Pearson: 22 Apr 1963
  Trudeau: 20 Apr 1968
1970s:
  Clark: 04 Jun 1979
1980s:
  Trudeau: 03 Mar 1980
  Turner: 30 Jun 1984
  Mulroney: 17 Sep 1984
1990s:
  Campbell: 25 Jun 1993
  Chrétien: 04 Nov 1993
2000s:
  Martin: 12 Dec 2003
  Harper: 06 Feb 2006

Result selection

Finally, we can ask GroupBy() to return a single object for each group, rather than the entire group. For example, suppose we want a count of the number of terms that started in each decade, together with the earliest term in each decade. We can do that as follows:

      var pmList39 = terms
        .OrderBy(term => term.start)
        .GroupBy(term => term.start.Year / 10,
          (year, termGroup) => new
          {
            decade = year * 10,
            number = termGroup.Count(),
            earliest = termGroup.Min(term => term.start)
          });
      Console.WriteLine("*** pmList39");
      foreach (var term in pmList39)
      {
        Console.WriteLine("{0}s:\n  {1} terms\n  Earliest: {2: dd MMM yyyy}",
          term.decade, term.number, term.earliest);
      }

In this case, the second argument in GroupBy() is a function which takes two parameters. The first parameter is the key for a given group, and the second parameter is the group itself. We can use this information to construct a summary object for that group. In this example, we create an anonymous object with 3 fields: the decade (calculated from the key ‘year’), the number of terms in that decade (by applying the Count() method to the group), and the earliest term (by applying the Min() method and passing it the start date).

This version of GroupBy() produces a list of single objects rather than a list of groups, so only a single loop is needed to iterate through it. The results are:

1860s:
  1 terms
  Earliest:  01 Jul 1867
1870s:
  2 terms
  Earliest:  07 Nov 1873
1890s:
  5 terms
  Earliest:  16 Jun 1891
1910s:
  1 terms
  Earliest:  10 Oct 1911
1920s:
  4 terms
  Earliest:  10 Jul 1920
1930s:
  2 terms
  Earliest:  07 Aug 1930
1940s:
  1 terms
  Earliest:  15 Nov 1948
1950s:
  1 terms
  Earliest:  21 Jun 1957
1960s:
  2 terms
  Earliest:  22 Apr 1963
1970s:
  1 terms
  Earliest:  04 Jun 1979
1980s:
  3 terms
  Earliest:  03 Mar 1980
1990s:
  2 terms
  Earliest:  25 Jun 1993
2000s:
  2 terms
  Earliest:  12 Dec 2003

Note the differences between these calls to GroupBy(). The first argument is always the key to be used in the grouping. If the second argument is an IEqualityComparer object, it is used to compare keys. If this argument is a function with a single parameter, it is used to select fields from each object placed in the group. Finally, if the argument is a function with two parameters, it is used to produce a summary object for each group.

These 3 features can be used in any combination (which is why there are 8 prototypes for GroupBy(). Whichever features you want to include, remember that they are placed in the order source.GroupBy(keySelector, elementSelector, resultSelector, equalityComparer).

Advertisements

LINQ Groups: Basic Groups

We’ve seen in the last post that LINQ’s Join() operator allows its results to be grouped according to the value of the key used to match pairs from two lists. LINQ offers a much more general grouping facility with the GroupBy() operator. There are actually 8 varieties of GroupBy(), so we’ll have a look at the features that comprise them. In this post, we’ll look at the simplest form of GroupBy() and consider the more advanced features in the next post.

All GroupBy() operators take a single sequence as input (as opposed to Join(), which takes two), and they all require you to specify a key value which is used for dividing the elements of the sequence into groups. The most basic form of GroupBy() does just that, with no frills. As an example, suppose we want a list of Canada’s prime ministers divided into groups according to the first letter of their last names (as might be found in an index). We can do that as follows:

      var pmList33a = primeMinisters.GroupBy(pm => pm.lastName[0]);
      foreach (var pmGroup in pmList33a)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

The single argument of GroupBy() is a function that calculates the key from a sequence element. Since our input sequence primeMinisters contains objects of class PrimeMinisters, we select the lastName field (a string) and take its first element.

A GroupBy() operation returns a sequence of groups rather than a sequence of individual elements. The prototype of this simplest version of GroupBy() is:

public static IEnumerable<IGrouping<TKey, TSource>> GroupBy<TSource, TKey>(
	this IEnumerable<TSource> source,
	Func<TSource, TKey> keySelector
)

From the return type, we see that GroupBy() returns an IEnumerable sequence, where each element is of type IGrouping<TKey, TSource>. That is, each group consists of a list of objects of type TSource accompanied by a single key value of type TKey. In our example here, TSource is PrimeMinisters and TKey is char.

Because the object returned by GroupBy() is a list of groups, if we want to access the individual elements of each group we need a nested loop; the outer loop iterates over the groups and the inner loop iterates over the elements within each group. Note that we’ve used the Key data field of the group in printing the output; the Key field is present in all IGrouping objects and contains the key value for that particular group. Thus the code above produces this output:

Group M:
  John Macdonald
  Alexander Mackenzie
  Arthur Meighen
  William Mackenzie King
  Brian Mulroney
  Paul Martin
Group A:
  John Abbott
Group T:
  John Thompson
  Charles Tupper
  Pierre Trudeau
  John Turner
Group B:
  Mackenzie Bowell
  Robert Borden
  Richard Bennett
Group L:
  Wilfrid Laurier
Group S:
  Louis St. Laurent
Group D:
  John Diefenbaker
Group P:
  Lester Pearson
Group C:
  Joe Clark
  Kim Campbell
  Jean Chrétien
Group H:
  Stephen Harper

The groups are created in the order they appear in the original sequence (primeMinisters), and the elements within each group are added in the order in which they appear in this sequence as well. That’s why the M group comes first, and the elements within each group are not in alphabetical order.

The simpler form of GroupBy() can be written as a query expression, so the above code would look like this:

      var pmList33 = from pm in primeMinisters
                     group pm by pm.lastName[0];
      foreach (var pmGroup in pmList33)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

The ‘from’ clause specifies the input sequence, and the key selector is given following the ‘by’ keyword.

If we want to order the output so that both the groups and the contents of each group are in alphabetical order, we can do this by adding a couple of orderby clauses. Here’s the result in both syntaxes:

      var pmList34 = from pm in primeMinisters
                     orderby pm.lastName
                     group pm by pm.lastName[0] into pmGroups
                     orderby pmGroups.Key
                     select pmGroups;
      foreach (var pmGroup in pmList34)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

      var pmList34a = primeMinisters
        .OrderBy(pm => pm.lastName)
        .GroupBy(pm => pm.lastName[0])
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList34a)
      {
        Console.WriteLine("Group {0}:", pmGroup.Key);
        foreach (var pm in pmGroup)
        {
          Console.WriteLine("  {0} {1}", pm.firstName, pm.lastName);
        }
      }

The standard query operator form (the second one) is the most straightforward: we first order the overall primeMinisters list, then group it as before, and finally order the output of the GroupBy() by doing an OrderBy() on the keys of the groups.

In the query expression form, we can’t follow a group clause directly by an orderby. We must first save the results of the group operation in a variable specified by the ‘into’ keyword (the same technique was used in a group join in the last post). Thus here we save the result of the group in pmGroups, and then apply orderby to that. The final ‘select pmGroups’ clause selects the group so the final output is a sequence of groups as before. The output from both forms of the code is:

Group A:
  John Abbott
Group B:
  Richard Bennett
  Robert Borden
  Mackenzie Bowell
Group C:
  Kim Campbell
  Jean Chrétien
  Joe Clark
Group D:
  John Diefenbaker
Group H:
  Stephen Harper
Group L:
  Wilfrid Laurier
Group M:
  John Macdonald
  Alexander Mackenzie
  William Mackenzie King
  Paul Martin
  Arthur Meighen
  Brian Mulroney
Group P:
  Lester Pearson
Group S:
  Louis St. Laurent
Group T:
  John Thompson
  Pierre Trudeau
  Charles Tupper
  John Turner

The key used for grouping need not be a simple data field; it can be a calculated value. For example, if we wanted to group the prime ministers’ terms of office into the decades in which they started, we could do something like this:

      var pmList36 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
                      {
                        first = pm.firstName,
                        last = pm.lastName,
                        start = term.start,
                        end = term.end
                      })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.start.Year / 10)
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList36)
      {
        Console.WriteLine("{0}s:", (pmGroup.Key * 10));
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
            pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
        }
      }

The Join() clause connects the list containing the PMs’ names with the list containing their terms. We order this list by the start date of each term, then pass the result into a GroupBy(). Here the key is the year of the start date divided by 10 (using integer division which throws away the remainder). All dates starting in the same decade will be in the same group. The output is:

1860s:
  John Macdonald: 01 Jul 1867 to 05 Nov 1873
1870s:
  Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
  John Macdonald: 17 Oct 1878 to 06 Jun 1891
1890s:
  John Abbott: 16 Jun 1891 to 24 Nov 1892
  John Thompson: 05 Dec 1892 to 12 Dec 1894
  Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
  Charles Tupper: 01 May 1896 to 08 Jul 1896
  Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911
1910s:
  Robert Borden: 10 Oct 1911 to 10 Jul 1920
1920s:
  Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
  William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
  Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
  William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
1930s:
  Richard Bennett: 07 Aug 1930 to 23 Oct 1935
  William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
1940s:
  Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
1950s:
  John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
1960s:
  Lester Pearson: 22 Apr 1963 to 20 Apr 1968
  Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
1970s:
  Joe Clark: 04 Jun 1979 to 02 Mar 1980
1980s:
  Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
  John Turner: 30 Jun 1984 to 16 Sep 1984
  Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
1990s:
  Kim Campbell: 25 Jun 1993 to 03 Nov 1993
  Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
2000s:
  Paul Martin: 12 Dec 2003 to 05 Feb 2006
  Stephen Harper: 06 Feb 2006 to 25 May 2012

As far as I can tell, there isn’t any way of writing this code as a single query expression, since we need to use a ‘select’ to create the output of the first ‘join’, and we can’t follow a ‘select’ with an ‘orderby’. However, it’s easy enough to do the job using two separate commands, and we get:

      var pmList35a = from pm in primeMinisters
                      join term in terms on pm.id equals term.id
                      orderby term.start
                      select new
                      {
                        first = pm.firstName,
                        last = pm.lastName,
                        start = term.start,
                        end = term.end
                      };
      var pmList35b = from pmTerm in pmList35a
                      group pmTerm by pmTerm.start.Year / 10 into pmGroups
                      orderby pmGroups.Key
                      select pmGroups;
      foreach (var pmGroup in pmList35b)
      {
        Console.WriteLine("{0}s:", (pmGroup.Key * 10));
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
            pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
        }
      }