LINQ Joins

When constructing a database, we often split up the information to be stored into several linked tables in order to avoid duplicating data. Using our Canadian prime minsters as an example, we store the name and political party of each prime minister in one list, and the terms of office in another. By providing each prime minister in the first list with a unique identifier, or key, we can use that key in the list of terms of office so that each term is linked to the prime minister that was in power at the time. That way, we can store as much information as we like about the prime minister without having to duplicate this information in the terms list.

However, we often need information from two or more lists in our output. Since the only link between a prime minister and a term of office is the id number of the prime minister, if we want to print out a list of prime ministers’ names with their corresponding terms of office we need to look up each id number in the first list to get the name to attach to each term of office.

This is such a common operation that most languages (such as SQL) that deal with data have commands for doing it. LINQ is no exception, and in fact the syntax is very similar to SQL: we need to look at the Join() operator.

The most basic type of Join() takes two input sequences called the outer and inner sequences. These sequences must each have a data field, or key, which is of the same data type and can be compared. A Join() will search for all possible matches between the two sequences by comparing these key values and return a list of objects derived from these matching pairs.

All this is a bit hard to understand in the abstract, so we’ll give an example. Suppose we want to build the list described above: from the list of prime ministers and the list of terms of office, we want a list of prime ministers’ names and their terms of office. The code for doing this is

      var pmList27 = primeMinisters
        .Join(terms, pm => pm.id, term => term.id,
        (pm, term) => new
        {
          first = pm.firstName,
          last = pm.lastName,
          start = term.start,
          end = term.end
        });
      foreach (var pm in pmList27)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

We see that Join() takes 4 arguments in addition to the object that calls it. The calling object (primeMinisters here) is the outer sequence, and the first argument (‘terms’ here) is the inner sequence. The second argument is a function which takes as input an element from the outer sequence (so here it takes ‘pm’ which is a PrimeMinisters object) and calculates the key which is to be used in the comparison. Thus here we use the id field of PrimeMinisters as the key.

The third argument does the same for the inner sequence, so here we’re using the id field of a Terms object as the comparison. The Join() will thus look for all pairs of (PrimeMinisters, Terms) objects where the id keys from the two are equal. This sort of join is called an equijoin since it uses an equality test on the keys (as opposed, say, to an inequality such as ‘greater than’).

The last argument to Join() specifies what is to be returned for each matching pair. This is determined by a function which takes two arguments, the first of which is the element from the outer sequence and the second of which is the element from the inner sequence. Note that you must put these arguments in the right order! We then construct, in this case, an anonymous type containing the first and last names of the prime minister from the pm object and the dates of office from the term object. The overall effect is therefore to produce a list of names and dates for all terms of office. The output is

John Macdonald: 01 Jul 1867 to 05 Nov 1873
John Macdonald: 17 Oct 1878 to 06 Jun 1891
Alexander Mackenzie: 07 Nov 1873 to 08 Oct 1878
John Abbott: 16 Jun 1891 to 24 Nov 1892
John Thompson: 05 Dec 1892 to 12 Dec 1894
Mackenzie Bowell: 21 Dec 1894 to 27 Apr 1896
Charles Tupper: 01 May 1896 to 08 Jul 1896
Wilfrid Laurier: 11 Jul 1896 to 06 Oct 1911
Robert Borden: 10 Oct 1911 to 10 Jul 1920
Arthur Meighen: 10 Jul 1920 to 29 Dec 1921
Arthur Meighen: 29 Jun 1926 to 25 Sep 1926
William Mackenzie King: 29 Dec 1921 to 28 Jun 1926
William Mackenzie King: 25 Sep 1926 to 07 Aug 1930
William Mackenzie King: 23 Oct 1935 to 15 Nov 1948
Richard Bennett: 07 Aug 1930 to 23 Oct 1935
Louis St. Laurent: 15 Nov 1948 to 21 Jun 1957
John Diefenbaker: 21 Jun 1957 to 22 Apr 1963
Lester Pearson: 22 Apr 1963 to 20 Apr 1968
Pierre Trudeau: 20 Apr 1968 to 03 Jun 1979
Pierre Trudeau: 03 Mar 1980 to 29 Jun 1984
Joe Clark: 04 Jun 1979 to 02 Mar 1980
John Turner: 30 Jun 1984 to 16 Sep 1984
Brian Mulroney: 17 Sep 1984 to 24 Jun 1993
Kim Campbell: 25 Jun 1993 to 03 Nov 1993
Jean Chrétien: 04 Nov 1993 to 11 Dec 2003
Paul Martin: 12 Dec 2003 to 05 Feb 2006
Stephen Harper: 06 Feb 2006 to 23 May 2012

I haven’t been able to find a definitive answer to the question of what order a Join() produces its results. In this case, it appears that the outer sequence is the dominant one, in that the prime ministers are listed in the order they appear in that sequence, with the terms attached to each person. However, if we swap the order in the Join() as in the following, we get the same output:

     var pmList29 = terms
      .Join(primeMinisters, term => term.id, pm => pm.id,
      (term, pm) => new
      {
        first = pm.firstName,
        last = pm.lastName,
        start = term.start,
        end = term.end
      });
      foreach (var pm in pmList29)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

If the outer sequence were the dominant one, we would expect the results to be ordered by the term of office (so that, for example, Mackenzie’s term would come between the two Macdonald terms), but this doesn’t seem to be the case. The moral of the story is that if the order of the output is important to you, you should use an OrderBy() clause to make sure it’s what you want.

A join can also be written as a query expression. The first query above becomes

      var pmList28 = from pm in primeMinisters
                     join term in terms on pm.id equals term.id
                     select new
                      {
                        first = pm.firstName,
                        last = pm.lastName,
                        start = term.start,
                        end = term.end
                      };
      foreach (var pm in pmList28)
      {
        Console.WriteLine("{0} {1}: {2:dd MMM yyyy} to {3:dd MMM yyyy}",
          pm.first, pm.last, pm.start, pm.end);
      }

The outer sequence is specified in the ‘from’ clause, while the inner sequence is given at the start of the ‘join’ clause (‘term in terms’ here). After the inner sequence is specified, the keyword ‘on’ introduces the equality test that is to be done. The two keys that are to be compared are given, and separated by the keyword ‘equals’. Note that you need to use this special keyword in a join and not the usual == operator.

A join in a query expression doesn’t have the returned value built in, as it does in standard operator notation. Rather, we use the usual ‘select’ clause to specify what should be returned for each matching pair.

Finally, it’s worth noting that the type of join we did here is an inner join, which means that an element from one sequence appears in the results only if it can be matched with an element from the other sequence. In our case, that’s fine, since all elements in both sequences have matches in the other, but in some cases we’d like to know if there is an element in one sequence that has no match in the other. To do that we need to do an outer join, but that’s a topic for a future post.

Advertisements
Post a comment or leave a trackback: Trackback URL.

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: