In the last post we saw how to use LINQ GroupBy() for relatively simple grouping. GroupBy() is capable of a couple of more advanced features which are worth looking at.
Custom equality tests
First, we saw before that the key used by GroupBy() to do the grouping could be calculated from the data fields in the objects in the sequence being grouped, rather than being just one of the bare data fields itself. For simple cases, it’s easiest to just place this calculation directly in the call to GroupBy() as we did earlier. However, sometimes the grouping key gets a bit more complex. LINQ allows us to define our own equality test for use in determining how keys are compared. As an example, suppose we wanted to group the terms of office of Canada’s prime ministers according to how many years each of these terms spanned. That is, we’d like all terms less than a year in one group, then those between 1 and 2 years and so on. Since a Terms object contains only the start and end dates of the term as DateTime objects, we need to calculate the difference to get a TimeSpan object and then declare that two such objects that lie within the same span of years are ‘equal’.
In order to create an equality test, we need to write a custom class that implements the IEqualityComparer<T> interface, where T is the data type being compared. This interface has two methods, Equals(T, T) and GetHashCode(T). The Equals() method returns a bool which is true if its two arguments are defined as equal and false if not. The GetHashCode() method is needed since grouping is done by storing sequence elements in a hash table, so we need to make sure that the hash codes for two elements that are defined as ‘equal’ are the same.
For our example here, we can use the following class:
class TermEqualityComparer : IEqualityComparer<TimeSpan> { public bool Equals(TimeSpan x, TimeSpan y) { return x.Days / 365 == y.Days / 365; } public int GetHashCode(TimeSpan obj) { return (obj.Days / 365).GetHashCode(); } }
Our equality test divides the number of days in each TimeSpan object by 365 (OK, we’re ignoring leap years) using integer division. If the two TimeSpans are equal in this measure then they represent terms that lie in the same one-year span.
For the hash code, we just use the same division and return the built-in hash code for the quotient. This ensures that all TimeSpans within the same year get the same hash code.
With this class, we can now write a GroupBy() call that does what we want:
TermEqualityComparer termEqualityComparer = new TermEqualityComparer(); var pmList37 = primeMinisters .Join(terms, pm => pm.id, term => term.id, (pm, term) => new { first = pm.firstName, last = pm.lastName, start = term.start, end = term.end }) .OrderBy(pmTerm => pmTerm.start) .GroupBy(pmTerm => pmTerm.end - pmTerm.start, termEqualityComparer) .OrderBy(pmGroup => pmGroup.Key); foreach (var pmGroup in pmList37) { int years = pmGroup.Key.Days / 365; Console.WriteLine("{0} to {1} years:", years, years + 1); foreach (var pmTerm in pmGroup) { Console.WriteLine(" {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}", pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end); } }
We declare a TermEqualityComparer object first. The LINQ code is much the same as in our earlier example in the last post, up to the GroupBy() call. This time it has two arguments. The first is the quantity to be used as the key, as usual, which in this case is the difference between the start and end of the term. The second argument is the equality testing object, so GroupBy() will pass the first argument to the Equals() method in the equality tester for each sequence element and use that test to sort the elements into groups.
You might wonder about the last OrderBy() call, which sorts the groups based on their keys. The actual TimeSpans for each element within a group may all be different, but according to our equality test, all TimeSpans within a single group are ‘equal’, so it doesn’t matter which one is used in the OrderBy().
Where the actual values of the keys does matter though is when we try to use their value in some other calculation. In our example, we want to print out the groups of terms, with each labelled by its key. However, if there is more than one element in a group, the TimeSpan for each element will probably be different, and since only one key is saved for each group, we can’t be sure which element in the group has that key (in fact, it seems to be the first element assigned to the group that has its key used for the group). Thus it’s usually best to use keys only in the same way that the original GroupBy() call did. In our example, we divide pmGroup.Key.Days by 365 to get the year span represented by that key, since we know that value does apply to all elements within that group.
The result of the code is:
0 to 1 years: Charles Tupper: 01 May 1896 to 08 Jul 1896 Arthur Meighen: 29 Jun 1926 to 25 Sep 1926 Joe Clark: 04 Jun 1979 to 02 Mar 1980 John Turner: 30 Jun 1984 to 16 Sep 1984 Kim Campbell: 25 Jun 1993 to 03 Nov 1993 1 to 2 years: John Abbott: 16 Jun 1891 to 24 Nov 1892 Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896 Arthur Meighen: 10 Jul 1920 to 29 Dec 1921 2 to 3 years: John Thompson: 05 Dec 1892 to 12 Dec 1894 Paul Martin: 12 Dec 2003 to 05 Feb 2006 3 to 4 years: William Mackenzie King: 25 Sep 1926 to 07 Aug 1930 4 to 5 years: Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878 William Mackenzie King: 29 Dec 1921 to 28 Jun 1926 Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984 5 to 6 years: Richard Bennett: 07 Aug 1930 to 23 Oct 1935 John Diefenbaker: 21 Jun 1957 to 22 Apr 1963 Lester Pearson: 22 Apr 1963 to 20 Apr 1968 6 to 7 years: John Macdonald: 01 Jul 1867 to 05 Nov 1873 Stephen Harper: 06 Feb 2006 to 25 May 2012 8 to 9 years: Robert Borden: 10 Oct 1911 to 10 Jul 1920 Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957 Brian Mulroney: 17 Sep 1984 to 24 Jun 1993 10 to 11 years: Jean Chrétien: 04 Nov 1993 to 11 Dec 2003 11 to 12 years: Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979 12 to 13 years: John Macdonald: 17 Oct 1878 to 06 Jun 1891 13 to 14 years: William Mackenzie King: 23 Oct 1935 to 15 Nov 1948 15 to 16 years: Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911
Custom return types
A GroupBy() call also allows you to customize which data fields should be returned, in much the same way as Join() did. For example, if we want to group the terms into the decades in which they started (as we did in the last post), we can have GroupBy() return only the last name and start date for each term. The code is:
var pmList38 = primeMinisters .Join(terms, pm => pm.id, term => term.id, (pm, term) => new { first = pm.firstName, last = pm.lastName, start = term.start, end = term.end }) .OrderBy(pmTerm => pmTerm.start) .GroupBy(pmTerm => pmTerm.start.Year / 10, pmTerm => new { last = pmTerm.last, start = pmTerm.start }) .OrderBy(pmGroup => pmGroup.Key); foreach (var pmGroup in pmList38) { Console.WriteLine("{0}s:", (pmGroup.Key * 10)); foreach (var pmTerm in pmGroup) { Console.WriteLine(" {0}: {1:dd MMM yyyy}", pmTerm.last, pmTerm.start); } }
In this case, the second argument of GroupBy() is a function that takes a single parameter (pmTerm here) which is used to construct the returned object to be placed in the group. Here, each object in a group will be an anonymous type with two fields: last and start. We use these two fields in the printout, and we get:
1860s: Macdonald: 01 Jul 1867 1870s: Mackenzie: 07 Nov 1873 Macdonald: 17 Oct 1878 1890s: Abbott: 16 Jun 1891 Thompson: 05 Dec 1892 Bowell: 21 Dec 1894 Tupper: 01 May 1896 Laurier: 11 Jul 1896 1910s: Borden: 10 Oct 1911 1920s: Meighen: 10 Jul 1920 Mackenzie King: 29 Dec 1921 Meighen: 29 Jun 1926 Mackenzie King: 25 Sep 1926 1930s: Bennett: 07 Aug 1930 Mackenzie King: 23 Oct 1935 1940s: St. Laurent: 15 Nov 1948 1950s: Diefenbaker: 21 Jun 1957 1960s: Pearson: 22 Apr 1963 Trudeau: 20 Apr 1968 1970s: Clark: 04 Jun 1979 1980s: Trudeau: 03 Mar 1980 Turner: 30 Jun 1984 Mulroney: 17 Sep 1984 1990s: Campbell: 25 Jun 1993 Chrétien: 04 Nov 1993 2000s: Martin: 12 Dec 2003 Harper: 06 Feb 2006
Result selection
Finally, we can ask GroupBy() to return a single object for each group, rather than the entire group. For example, suppose we want a count of the number of terms that started in each decade, together with the earliest term in each decade. We can do that as follows:
var pmList39 = terms .OrderBy(term => term.start) .GroupBy(term => term.start.Year / 10, (year, termGroup) => new { decade = year * 10, number = termGroup.Count(), earliest = termGroup.Min(term => term.start) }); Console.WriteLine("*** pmList39"); foreach (var term in pmList39) { Console.WriteLine("{0}s:\n {1} terms\n Earliest: {2: dd MMM yyyy}", term.decade, term.number, term.earliest); }
In this case, the second argument in GroupBy() is a function which takes two parameters. The first parameter is the key for a given group, and the second parameter is the group itself. We can use this information to construct a summary object for that group. In this example, we create an anonymous object with 3 fields: the decade (calculated from the key ‘year’), the number of terms in that decade (by applying the Count() method to the group), and the earliest term (by applying the Min() method and passing it the start date).
This version of GroupBy() produces a list of single objects rather than a list of groups, so only a single loop is needed to iterate through it. The results are:
1860s: 1 terms Earliest: 01 Jul 1867 1870s: 2 terms Earliest: 07 Nov 1873 1890s: 5 terms Earliest: 16 Jun 1891 1910s: 1 terms Earliest: 10 Oct 1911 1920s: 4 terms Earliest: 10 Jul 1920 1930s: 2 terms Earliest: 07 Aug 1930 1940s: 1 terms Earliest: 15 Nov 1948 1950s: 1 terms Earliest: 21 Jun 1957 1960s: 2 terms Earliest: 22 Apr 1963 1970s: 1 terms Earliest: 04 Jun 1979 1980s: 3 terms Earliest: 03 Mar 1980 1990s: 2 terms Earliest: 25 Jun 1993 2000s: 2 terms Earliest: 12 Dec 2003
Note the differences between these calls to GroupBy(). The first argument is always the key to be used in the grouping. If the second argument is an IEqualityComparer object, it is used to compare keys. If this argument is a function with a single parameter, it is used to select fields from each object placed in the group. Finally, if the argument is a function with two parameters, it is used to produce a summary object for each group.
These 3 features can be used in any combination (which is why there are 8 prototypes for GroupBy(). Whichever features you want to include, remember that they are placed in the order source.GroupBy(keySelector, elementSelector, resultSelector, equalityComparer).