Tag Archives: LINQ join

LINQ Group Joins

In the last post, we saw how to use the Join() operator in LINQ. Recall that Join() takes two input sequences and looks for matches between elements in the two sequences based on the equality of a key field that is present in both sequences. The ordinary Join() will produce a separate output element for each matched pair. Thus in our example with Canada’s prime ministers where we used a join to get a list of prime ministers matched with their terms of office, those PMs who had more than one term were represented in the Join() by as many elements as they had terms of office.

Sometimes it’s useful to produce a single list of elements for each value of the key in the first (outer) input sequence. For example, we might want a list of all terms of office attached to each prime minister. This sort of query isn’t normal for a database (there’s no equivalent in SQL for example), but LINQ provides the GroupJoin() which does just that.

To see how it works, suppose we wanted to produce a list of prime ministers where we calculate the total time served for each person. For PMs serving multiple terms, we need to add up the time spans of all the terms for that person. We can do that with the following code:

      var pmList30 = primeMinisters
        .GroupJoin(terms,
          pm => pm.id,
          term => term.id,
          (pm, pmTerms) => new
          {
            first = pm.firstName,
            last = pm.lastName,
            totalDays = pmTerms.Sum(time => (time.end - time.start).Days)
          });
      foreach (var pm in pmList30)
      {
        Console.WriteLine("{0} {1} served a total of {2} days.",
          pm.first, pm.last, pm.totalDays);
      }

Here, primeMinisters is the outer sequence and terms is the inner sequence. As with the regular Join(), the next two arguments specify the keys on which the match should be done, so here we match pm.id with term.id as before. Where a GroupJoin() differs from a Join() is in the last argument. In the Join(), this last argument was a function that took the two terms in a matched pair as arguments and produced some data object calculated from these two objects as output.

In the GroupJoin(), the last argument is a function whose first argument is the element from the outer sequence as before, but this time the second argument is a sequence of elements from the inner sequence that match this outer element. Thus pmTerms contains a list of all terms matching that particular pm.

In this example, we create an anonymous object containing the first and last names of the PM, and then use LINQ’s Sum() operator (which we’ll consider in more detail later – for now, take note that this is a non-deferred operator, so it gets calculated right away) to add up all the TimeSpan objects that we get by calculating the difference between end and start times for a given term of office. The output is:

John Macdonald served a total of 6934 days.
Alexander Mackenzie served a total of 1796 days.
John Abbott served a total of 527 days.
John Thompson served a total of 737 days.
Mackenzie Bowell served a total of 493 days.
Charles Tupper served a total of 68 days.
Wilfrid Laurier served a total of 5564 days.
Robert Borden served a total of 3196 days.
Arthur Meighen served a total of 625 days.
William Mackenzie King served a total of 7826 days.
Richard Bennett served a total of 1903 days.
Louis St. Laurent served a total of 3140 days.
John Diefenbaker served a total of 2131 days.
Lester Pearson served a total of 1825 days.
Pierre Trudeau served a total of 5640 days.
Joe Clark served a total of 272 days.
John Turner served a total of 78 days.
Brian Mulroney served a total of 3202 days.
Kim Campbell served a total of 131 days.
Jean Chrétien served a total of 3689 days.
Paul Martin served a total of 786 days.
Stephen Harper served a total of 2298 days.

A group join can also be written as a query expression. The above code looks like this:

      var pmList31 = from pm in primeMinisters
                     join term in terms on pm.id equals term.id into pmTerms
                     select new
                     {
                       first = pm.firstName,
                       last = pm.lastName,
                       totalDays = pmTerms.Sum(time => (time.end - time.start).Days)
                     };
      foreach (var pm in pmList31)
      {
        Console.WriteLine("{0} {1} served a total of {2} days.",
          pm.first, pm.last, pm.totalDays);
      }

The feature that makes this a group join is the addition of ‘into pmTerms’ at the end of the join clause. This defines pmTerms as the sequence of term objects that matches a given pm.id.

We can also save the entire sequence produced by each match and iterate over it in the usual way. For example, we can produce an annotated list of terms served by each PM as follows.

      var pmList32 = from pm in primeMinisters
                     join term in terms on pm.id equals term.id into pmTerms
                     select new
                     {
                       first = pm.firstName,
                       last = pm.lastName,
                       termsList = pmTerms
                     };
      foreach (var pm in pmList32)
      {
        Console.WriteLine("{0} {1}:", pm.first, pm.last);
        foreach (var term in pm.termsList)
        {
          Console.WriteLine("  {0:dd MMM yyyy} to {1:dd MMM yyyy}", term.start, term.end);
        }
      }

We do the same query as before, but this time we save the entire pmTerms list in the termsList field of the output object. We then use a nested foreach loop to print out the list of terms for each PM:

John Macdonald:
  01 Jul 1867 to 05 Nov 1873
  17 Oct 1878 to 06 Jun 1891
Alexander Mackenzie:
  07 Nov 1873 to 08 Oct 1878
John Abbott:
  16 Jun 1891 to 24 Nov 1892
John Thompson:
  05 Dec 1892 to 12 Dec 1894
Mackenzie Bowell:
  21 Dec 1894 to 27 Apr 1896
Charles Tupper:
  01 May 1896 to 08 Jul 1896
Wilfrid Laurier:
  11 Jul 1896 to 06 Oct 1911
Robert Borden:
  10 Oct 1911 to 10 Jul 1920
Arthur Meighen:
  10 Jul 1920 to 29 Dec 1921
  29 Jun 1926 to 25 Sep 1926
William Mackenzie King:
  29 Dec 1921 to 28 Jun 1926
  25 Sep 1926 to 07 Aug 1930
  23 Oct 1935 to 15 Nov 1948
Richard Bennett:
  07 Aug 1930 to 23 Oct 1935
Louis St. Laurent:
  15 Nov 1948 to 21 Jun 1957
John Diefenbaker:
  21 Jun 1957 to 22 Apr 1963
Lester Pearson:
  22 Apr 1963 to 20 Apr 1968
Pierre Trudeau:
  20 Apr 1968 to 03 Jun 1979
  03 Mar 1980 to 29 Jun 1984
Joe Clark:
  04 Jun 1979 to 02 Mar 1980
John Turner:
  30 Jun 1984 to 16 Sep 1984
Brian Mulroney:
  17 Sep 1984 to 24 Jun 1993
Kim Campbell:
  25 Jun 1993 to 03 Nov 1993
Jean Chrétien:
  04 Nov 1993 to 11 Dec 2003
Paul Martin:
  12 Dec 2003 to 05 Feb 2006
Stephen Harper:
  06 Feb 2006 to 23 May 2012
Advertisements

LINQ Joins

When constructing a database, we often split up the information to be stored into several linked tables in order to avoid duplicating data. Using our Canadian prime minsters as an example, we store the name and political party of each prime minister in one list, and the terms of office in another. By providing each prime minister in the first list with a unique identifier, or key, we can use that key in the list of terms of office so that each term is linked to the prime minister that was in power at the time. That way, we can store as much information as we like about the prime minister without having to duplicate this information in the terms list.

However, we often need information from two or more lists in our output. Since the only link between a prime minister and a term of office is the id number of the prime minister, if we want to print out a list of prime ministers’ names with their corresponding terms of office we need to look up each id number in the first list to get the name to attach to each term of office.

This is such a common operation that most languages (such as SQL) that deal with data have commands for doing it. LINQ is no exception, and in fact the syntax is very similar to SQL: we need to look at the Join() operator.

The most basic type of Join() takes two input sequences called the outer and inner sequences. These sequences must each have a data field, or key, which is of the same data type and can be compared. A Join() will search for all possible matches between the two sequences by comparing these key values and return a list of objects derived from these matching pairs.

All this is a bit hard to understand in the abstract, so we’ll give an example. Suppose we want to build the list described above: from the list of prime ministers and the list of terms of office, we want a list of prime ministers’ names and their terms of office. The code for doing this is

      var pmList27 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        });
      foreach (var pm in pmList27)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

We see that Join() takes 4 arguments in addition to the object that calls it. The calling object (primeMinisters here) is the outer sequence, and the first argument (‘terms’ here) is the inner sequence. The second argument is a function which takes as input an element from the outer sequence (so here it takes ‘pm’ which is a PrimeMinisters object) and calculates the key which is to be used in the comparison. Thus here we use the id field of PrimeMinisters as the key.

The third argument does the same for the inner sequence, so here we’re using the id field of a Terms object as the comparison. The Join() will thus look for all pairs of (PrimeMinisters, Terms) objects where the id keys from the two are equal. This sort of join is called an equijoin since it uses an equality test on the keys (as opposed, say, to an inequality such as ‘greater than’).

The last argument to Join() specifies what is to be returned for each matching pair. This is determined by a function which takes two arguments, the first of which is the element from the outer sequence and the second of which is the element from the inner sequence. Note that you must put these arguments in the right order! We then construct, in this case, an anonymous type containing the first and last names of the prime minister from the pm object and the dates of office from the term object. The overall effect is therefore to produce a list of names and dates for all terms of office. The output is

John Macdonald: 01 Jul 1867 to 05 Nov 1873
John Macdonald: 17 Oct 1878 to 06 Jun 1891
Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
John Abbott: 16 Jun 1891 to 24 Nov 1892
John Thompson: 05 Dec 1892 to 12 Dec 1894
Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
Charles Tupper: 01 May 1896 to 08 Jul 1896
Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911
Robert Borden: 10 Oct 1911 to 10 Jul 1920
Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
Richard Bennett: 07 Aug 1930 to 23 Oct 1935
Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
Lester Pearson: 22 Apr 1963 to 20 Apr 1968
Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
Joe Clark: 04 Jun 1979 to 02 Mar 1980
John Turner: 30 Jun 1984 to 16 Sep 1984
Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
Kim Campbell: 25 Jun 1993 to 03 Nov 1993
Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
Paul Martin: 12 Dec 2003 to 05 Feb 2006
Stephen Harper: 06 Feb 2006 to 23 May 2012

I haven’t been able to find a definitive answer to the question of what order a Join() produces its results. In this case, it appears that the outer sequence is the dominant one, in that the prime ministers are listed in the order they appear in that sequence, with the terms attached to each person. However, if we swap the order in the Join() as in the following, we get the same output:

     var pmList29 = terms
      .Join(primeMinisters, term => term.id, pm => pm.id,
      (term, pm) => new
      {
        first = pm.firstName,
        last = pm.lastName,
        start = term.start,
        end = term.end
      });
      foreach (var pm in pmList29)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

If the outer sequence were the dominant one, we would expect the results to be ordered by the term of office (so that, for example, Mackenzie’s term would come between the two Macdonald terms), but this doesn’t seem to be the case. The moral of the story is that if the order of the output is important to you, you should use an OrderBy() clause to make sure it’s what you want.

A join can also be written as a query expression. The first query above becomes

      var pmList28 = from pm in primeMinisters
                     join term in terms on pm.id equals term.id
                     select new
                      {
                        first = pm.firstName,
                        last = pm.lastName,
                        start = term.start,
                        end = term.end
                      };
      foreach (var pm in pmList28)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

The outer sequence is specified in the ‘from’ clause, while the inner sequence is given at the start of the ‘join’ clause (‘term in terms’ here). After the inner sequence is specified, the keyword ‘on’ introduces the equality test that is to be done. The two keys that are to be compared are given, and separated by the keyword ‘equals’. Note that you need to use this special keyword in a join and not the usual == operator.

A join in a query expression doesn’t have the returned value built in, as it does in standard operator notation. Rather, we use the usual ‘select’ clause to specify what should be returned for each matching pair.

Finally, it’s worth noting that the type of join we did here is an inner join, which means that an element from one sequence appears in the results only if it can be matched with an element from the other sequence. In our case, that’s fine, since all elements in both sequences have matches in the other, but in some cases we’d like to know if there is an element in one sequence that has no match in the other. To do that we need to do an outer join, but that’s a topic for a future post.