Tag Archives: IEqualityComparer

LINQ: ToDictionary

Up till now, we’ve considered only the deferred standard query operators, which are not evaluated until their result is actually enumerated by, for example, running through the result in a foreach loop.

LINQ also has a number of non-deferred operators, which are evaluated at the point where they are called. The first of these we’ll look at is  ToDictionary.

C# has a built in Dictionary data type, which is an implementation of a hash table. A hash table is essentially a glorified array, with the main difference being that any data type can be used as the array index or key. For example, if we wanted to store our list of Canadian prime ministers in a dictionary, we could use the integer ID we’ve assigned each prime minister as the key, or we could use the person’s last name, or even define some other data type from the components of a PrimeMinisters object. The one essential property is that each key must be unique, so that only one prime minister is stored for each key.

LINQ allows a dictionary to be constructed from an IEnumerable<T> source, where T is the data type of the objects in the input sequence. The simplest version of ToDictionary allows only the key to be defined for each element in the input sequence. An example is

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmDictionary01 = primeMinisters.ToDictionary(k => k.id);
      Console.WriteLine("----->pmDictionary01");
      foreach (int key in pmDictionary01.Keys)
      {
        Console.WriteLine("Prime minister with ID {0}: {1} {2}",
          key, pmDictionary01[key].firstName, pmDictionary01[key].lastName);
      }

ToDictionary() here takes a single argument, which is a lambda expression defining the key. The variable k is an element from the input sequence, and we’ve selected the ‘id’ field from that element to use as the key.

Once the dictionary is built, we use a foreach loop to run through the list by selecting each key from the Keys property of the dictionary. We use array-like notation (square brackets) to reference an element in the dictionary. Each element in the dictionary is an object of type PrimeMinsters.

The output is:

----->pmDictionary01
Prime minister with ID 1: John Macdonald
Prime minister with ID 2: Alexander Mackenzie
Prime minister with ID 3: John Abbott
Prime minister with ID 4: John Thompson
Prime minister with ID 5: Mackenzie Bowell
Prime minister with ID 6: Charles Tupper
Prime minister with ID 7: Wilfrid Laurier
Prime minister with ID 8: Robert Borden
Prime minister with ID 9: Arthur Meighen
Prime minister with ID 10: William Mackenzie King
Prime minister with ID 11: Richard Bennett
Prime minister with ID 12: Louis St. Laurent
Prime minister with ID 13: John Diefenbaker
Prime minister with ID 14: Lester Pearson
Prime minister with ID 15: Pierre Trudeau
Prime minister with ID 16: Joe Clark
Prime minister with ID 17: John Turner
Prime minister with ID 18: Brian Mulroney
Prime minister with ID 19: Kim Campbell
Prime minister with ID 20: Jean Chrétien
Prime minister with ID 21: Paul Martin
Prime minister with ID 22: Stephen Harper

There are three more variants of ToDictionary, each offering a bit more flexibility than the basic version.

A second type allows the specification of a comparer class which can be used for defining the equality of objects used as keys. In the previous example, the default definition of equality was used; since the keys were ints, two keys were equal if they had the same numerical value.

However, it is possible to define keys to be equal based on any criterion we like. For example, if we stored the ID of each prime minister as a string instead of an int, then we could define two keys to be equal if their strings parsed to the same numerical value. This would allow the strings 12 and 00012 to be equal as keys, since the leading zeroes don’t change the numerical value.

To use this feature, we must first define a comparer class, in much the same way as we did when comparing the terms of office. The comparer class here is

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace LinqObjects01
{
  class IdKeyEqualityComparer : IEqualityComparer<string>
  {
    public bool Equals(string x, string y)
    {
      return Int32.Parse(x) == Int32.Parse(y);
    }

    public int GetHashCode(string obj)
    {
      return (Int32.Parse(obj)).GetHashCode();
    }
  }
}

Remember that we need to implement IEqualityComparer<string> and provide an Equals() and GetHashCode() method. In Equals() we parse the two strings and define equality to be true if their numerical values are equal. GetHashCode() must return the same code for two objects that are considered equal, so we call GetHashCode() on the parsed int.

With this class in hand, we can use it in the second form of ToDictionary:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmDictionary02 = primeMinisters.ToDictionary(k => k.id.ToString(),
        new IdKeyEqualityComparer());
      Console.WriteLine("----->pmDictionary02");
      foreach (string key in pmDictionary02.Keys)
      {
        string zeroKey = "000" + key;
        Console.WriteLine("Prime minister with ID {0}: {1} {2}",
          key, pmDictionary02[zeroKey].firstName, pmDictionary02[zeroKey].lastName);
      }

This time, we store the key as a string and pass an IdKeyEqualityComparer as the second parameter to ToDictionary. When we print out the results, we create a different string by prepending three zeroes onto the key in the dictionary, then use that zeroKey as the key when looking up entries in the dictionary. The dictionary uses its comparer object to compare zeroKey to the valid keys in the dictionary, and if a match is found, the corresponding object is returned. The output from this code is the same as that above.

If no match is found an exception is thrown, as you might expect, so be careful to ensure that all keys used to access the dictionary are valid.

The third variant of ToDictionary allows us to create our own data type from the sequence element being processed and store that new data type in the dictionary. For example, suppose we wanted to store the string representation of each prime minister in the dictionary instead of the original PrimeMinisters object. We can do that using the following code.

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmDictionary03 = primeMinisters.ToDictionary(k => k.id,
        k => k.ToString());
      Console.WriteLine("----->pmDictionary03");
      foreach (int key in pmDictionary03.Keys)
      {
        Console.WriteLine(pmDictionary03[key]);
      }

The first argument to ToDictionary specifies the key as usual (we’ve gone back to using the int version of the key). The second parameter calls the ToString() method to produce a string which is stored in the dictionary. When we list the elements in the dictionary, we print out the entry directly, since it’s a string and not a compound object.

This time the output is:

----->pmDictionary03
1. John Macdonald (Conservative)
2. Alexander Mackenzie (Liberal)
3. John Abbott (Conservative)
4. John Thompson (Conservative)
5. Mackenzie Bowell (Conservative)
6. Charles Tupper (Conservative)
7. Wilfrid Laurier (Liberal)
8. Robert Borden (Conservative)
9. Arthur Meighen (Conservative)
10. William Mackenzie King (Liberal)
11. Richard Bennett (Conservative)
12. Louis St. Laurent (Liberal)
13. John Diefenbaker (Conservative)
14. Lester Pearson (Liberal)
15. Pierre Trudeau (Liberal)
16. Joe Clark (Conservative)
17. John Turner (Liberal)
18. Brian Mulroney (Conservative)
19. Kim Campbell (Conservative)
20. Jean Chrétien (Liberal)
21. Paul Martin (Liberal)
22. Stephen Harper (Conservative)

A final version of ToDictionary combines the last two versions, so we can provide both a key comparer and a custom data type. For example, if we wanted to store keys as strings and store the string version of each PrimeMinisters object, we could write:

      PrimeMinisters[] primeMinisters = PrimeMinisters.GetPrimeMinistersArray();
      var pmDictionary04 = primeMinisters.ToDictionary(k => k.id.ToString(),
        k => k.ToString(), new IdKeyEqualityComparer());
      Console.WriteLine("----->pmDictionary04");
      foreach (string key in pmDictionary04.Keys)
      {
        string zeroKey = "000" + key;
        Console.WriteLine(pmDictionary04[zeroKey]);
      }

The output from this is the same as from pmDictionary03.

Advertisements

LINQ Groups: Equality testing and result selection

In the last post we saw how to use LINQ GroupBy() for relatively simple grouping. GroupBy() is capable of a couple of more advanced features which are worth looking at.

Custom equality tests

First, we saw before that the key used by GroupBy() to do the grouping could be calculated from the data fields in the objects in the sequence being grouped, rather than being just one of the bare data fields itself. For simple cases, it’s easiest to just place this calculation directly in the call to GroupBy() as we did earlier. However, sometimes the grouping key gets a bit more complex. LINQ allows us to define our own equality test for use in determining how keys are compared. As an example, suppose we wanted to group the terms of office of Canada’s prime ministers according to how many years each of these terms spanned. That is, we’d like all terms less than a year in one group, then those between 1 and 2 years and so on. Since a Terms object contains only the start and end dates of the term as DateTime objects, we need to calculate the difference to get a TimeSpan object and then declare that two such objects that lie within the same span of years are ‘equal’.

In order to create an equality test, we need to write a custom class that implements the IEqualityComparer<T> interface, where T is the data type being compared. This interface has two methods, Equals(T, T) and GetHashCode(T). The Equals() method returns a bool which is true if its two arguments are defined as equal and false if not. The GetHashCode() method is needed since grouping is done by storing sequence elements in a hash table, so we need to make sure that the hash codes for two elements that are defined as ‘equal’ are the same.

For our example here, we can use the following class:

  class TermEqualityComparer : IEqualityComparer<TimeSpan>
  {
    public bool Equals(TimeSpan x, TimeSpan y)
    {
      return x.Days / 365 == y.Days / 365;
    }

    public int GetHashCode(TimeSpan obj)
    {
      return (obj.Days / 365).GetHashCode();
    }
  }

Our equality test divides the number of days in each TimeSpan object by 365 (OK, we’re ignoring leap years) using integer division. If the two TimeSpans are equal in this measure then they represent terms that lie in the same one-year span.

For the hash code, we just use the same division and return the built-in hash code for the quotient. This ensures that all TimeSpans within the same year get the same hash code.

With this class, we can now write a GroupBy() call that does what we want:

      TermEqualityComparer termEqualityComparer = new TermEqualityComparer();
      var pmList37 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.end - pmTerm.start, termEqualityComparer)
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList37)
      {
        int years = pmGroup.Key.Days / 365;
        Console.WriteLine("{0} to {1} years:", years, years + 1);
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
            pmTerm.first, pmTerm.last, pmTerm.start, pmTerm.end);
        }
      }

We declare a TermEqualityComparer object first. The LINQ code is much the same as in our earlier example in the last post, up to the GroupBy() call. This time it has two arguments. The first is the quantity to be used as the key, as usual, which in this case is the difference between the start and end of the term. The second argument is the equality testing object, so GroupBy() will pass the first argument to the Equals() method in the equality tester for each sequence element and use that test to sort the elements into groups.

You might wonder about the last OrderBy() call, which sorts the groups based on their keys. The actual TimeSpans for each element within a group may all be different, but according to our equality test, all TimeSpans within a single group are ‘equal’, so it doesn’t matter which one is used in the OrderBy().

Where the actual values of the keys does matter though is when we try to use their value in some other calculation. In our example, we want to print out the groups of terms, with each labelled by its key. However, if there is more than one element in a group, the TimeSpan for each element will probably be different, and since only one key is saved for each group, we can’t be sure which element in the group has that key (in fact, it seems to be the first element assigned to the group that has its key used for the group). Thus it’s usually best to use keys only in the same way that the original GroupBy() call did. In our example, we divide pmGroup.Key.Days by 365 to get the year span represented by that key, since we know that value does apply to all elements within that group.

The result of the code is:

0 to 1 years:
  Charles Tupper: 01 May 1896 to 08 Jul 1896
  Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
  Joe Clark: 04 Jun 1979 to 02 Mar 1980
  John Turner: 30 Jun 1984 to 16 Sep 1984
  Kim Campbell: 25 Jun 1993 to 03 Nov 1993
1 to 2 years:
  John Abbott: 16 Jun 1891 to 24 Nov 1892
  Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
  Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
2 to 3 years:
  John Thompson: 05 Dec 1892 to 12 Dec 1894
  Paul Martin: 12 Dec 2003 to 05 Feb 2006
3 to 4 years:
  William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
4 to 5 years:
  Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
  William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
  Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
5 to 6 years:
  Richard Bennett: 07 Aug 1930 to 23 Oct 1935
  John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
  Lester Pearson: 22 Apr 1963 to 20 Apr 1968
6 to 7 years:
  John Macdonald: 01 Jul 1867 to 05 Nov 1873
  Stephen Harper: 06 Feb 2006 to 25 May 2012
8 to 9 years:
  Robert Borden: 10 Oct 1911 to 10 Jul 1920
  Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
  Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
10 to 11 years:
  Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
11 to 12 years:
  Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
12 to 13 years:
  John Macdonald: 17 Oct 1878 to 06 Jun 1891
13 to 14 years:
  William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
15 to 16 years:
  Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911

Custom return types

A GroupBy() call also allows you to customize which data fields should be returned, in much the same way as Join() did. For example, if we want to group the terms into the decades in which they started (as we did in the last post), we can have GroupBy() return only the last name and start date for each term. The code is:

      var pmList38 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        })
        .OrderBy(pmTerm => pmTerm.start)
        .GroupBy(pmTerm => pmTerm.start.Year / 10,
          pmTerm => new
          {
            last = pmTerm.last,
            start = pmTerm.start
          })
        .OrderBy(pmGroup => pmGroup.Key);
      foreach (var pmGroup in pmList38)
      {
        Console.WriteLine("{0}s:", (pmGroup.Key * 10));
        foreach (var pmTerm in pmGroup)
        {
          Console.WriteLine("  {0}: {1:dd MMM yyyy}",
            pmTerm.last, pmTerm.start);
        }
      }

In this case, the second argument of GroupBy() is a function that takes a single parameter (pmTerm here) which is used to construct the returned object to be placed in the group. Here, each object in a group will be an anonymous type with two fields: last and start. We use these two fields in the printout, and we get:

1860s:
  Macdonald: 01 Jul 1867
1870s:
  Mackenzie: 07 Nov 1873
  Macdonald: 17 Oct 1878
1890s:
  Abbott: 16 Jun 1891
  Thompson: 05 Dec 1892
  Bowell: 21 Dec 1894
  Tupper: 01 May 1896
  Laurier: 11 Jul 1896
1910s:
  Borden: 10 Oct 1911
1920s:
  Meighen: 10 Jul 1920
  Mackenzie King: 29 Dec 1921
  Meighen: 29 Jun 1926
  Mackenzie King: 25 Sep 1926
1930s:
  Bennett: 07 Aug 1930
  Mackenzie King: 23 Oct 1935
1940s:
  St. Laurent: 15 Nov 1948
1950s:
  Diefenbaker: 21 Jun 1957
1960s:
  Pearson: 22 Apr 1963
  Trudeau: 20 Apr 1968
1970s:
  Clark: 04 Jun 1979
1980s:
  Trudeau: 03 Mar 1980
  Turner: 30 Jun 1984
  Mulroney: 17 Sep 1984
1990s:
  Campbell: 25 Jun 1993
  Chrétien: 04 Nov 1993
2000s:
  Martin: 12 Dec 2003
  Harper: 06 Feb 2006

Result selection

Finally, we can ask GroupBy() to return a single object for each group, rather than the entire group. For example, suppose we want a count of the number of terms that started in each decade, together with the earliest term in each decade. We can do that as follows:

      var pmList39 = terms
        .OrderBy(term => term.start)
        .GroupBy(term => term.start.Year / 10,
          (year, termGroup) => new
          {
            decade = year * 10,
            number = termGroup.Count(),
            earliest = termGroup.Min(term => term.start)
          });
      Console.WriteLine("*** pmList39");
      foreach (var term in pmList39)
      {
        Console.WriteLine("{0}s:\n  {1} terms\n  Earliest: {2: dd MMM yyyy}",
          term.decade, term.number, term.earliest);
      }

In this case, the second argument in GroupBy() is a function which takes two parameters. The first parameter is the key for a given group, and the second parameter is the group itself. We can use this information to construct a summary object for that group. In this example, we create an anonymous object with 3 fields: the decade (calculated from the key ‘year’), the number of terms in that decade (by applying the Count() method to the group), and the earliest term (by applying the Min() method and passing it the start date).

This version of GroupBy() produces a list of single objects rather than a list of groups, so only a single loop is needed to iterate through it. The results are:

1860s:
  1 terms
  Earliest:  01 Jul 1867
1870s:
  2 terms
  Earliest:  07 Nov 1873
1890s:
  5 terms
  Earliest:  16 Jun 1891
1910s:
  1 terms
  Earliest:  10 Oct 1911
1920s:
  4 terms
  Earliest:  10 Jul 1920
1930s:
  2 terms
  Earliest:  07 Aug 1930
1940s:
  1 terms
  Earliest:  15 Nov 1948
1950s:
  1 terms
  Earliest:  21 Jun 1957
1960s:
  2 terms
  Earliest:  22 Apr 1963
1970s:
  1 terms
  Earliest:  04 Jun 1979
1980s:
  3 terms
  Earliest:  03 Mar 1980
1990s:
  2 terms
  Earliest:  25 Jun 1993
2000s:
  2 terms
  Earliest:  12 Dec 2003

Note the differences between these calls to GroupBy(). The first argument is always the key to be used in the grouping. If the second argument is an IEqualityComparer object, it is used to compare keys. If this argument is a function with a single parameter, it is used to select fields from each object placed in the group. Finally, if the argument is a function with two parameters, it is used to produce a summary object for each group.

These 3 features can be used in any combination (which is why there are 8 prototypes for GroupBy(). Whichever features you want to include, remember that they are placed in the order source.GroupBy(keySelector, elementSelector, resultSelector, equalityComparer).