An Extensive Examination of LINQ: Grouping and Joining Data
By Scott Mitchell
| A Multipart Series on LINQ |
|---|
|
This article is one in a series of articles on LINQ, which was introduced with .NET version 3.5.
|
Introduction
As discussed in An Introduction to LINQ, one of the cornerstones of LINQ is the standard query operators, which are a set of extension methods on the
IEnumerable interface added to the .NET Framework version 3.5. The standard
query operators can be applied to any enumeration - any collection of "things." In The Standard Query
Operators installment we looked at some of the more common query operators, such as Where, Select, OrderBy, and others.
Each standard query operator can be classified as a certain type of operator. There are aggregation operators like Count, Sum, and Max;
element operators like First, Last, and ElementAt let you pick out a specific element from a sequence; and the ordering operators
OrderBy and OrderByDescending order the elements of a sequence based on a specified sorting criteria.
Another class of query operators that we've yet to explore are grouping and joining operators. The grouping and joining operators work with two (or more) sequences
and combine them together, much like how a JOIN in SQL combines records from two (or more) tables into a single resultset. Through LINQ's standard query operators
(or via its query syntax), it is possible to perform: nested (or grouped) queries; cross joins, or
the Cartesian product of two sequences; inner joins; and left outer joins.
This article explores the grouping and joining operators available in LINQ with a number of examples in both Visual Basic and C#. As with the previous installments in this article series, the complete code is available for download. Read on to learn more!
Examining the Sample Data
The examples in this article look at how to create LINQ queries that group and join the same set of sample data. If you download the demos (available at the end of this article) you'll find a class file named
FauxNorthwind.cs in the App_Code/CSCode folder. This file defines a namespace (FauxNorthwind)
and three classes within that namespace. The first two, Category and Product, mimic the schema of the Category and
Product tables in the Northwind database. The Category class has CategoryId and CategoryName properties, while the
Product class has the properties ProductId, ProductName, Category,
UnitPrice, and Discontinued.
public class Category
|
The third class, Data, is a static class that defines the "records" in the sample data via two static arrays. There's an array of type Category named
Categories and an array of type Product named Products. The Categories array is populated with six elements, while the
Products array defines 15 products. (These arrays are populated using the collection initializer and object initializer enhancements added to C# 3.0 and discussed
in the Extension Methods, Implicitly Typed Variables, and Object Initializers installment in this article
series.)
A snapshot of the data defined in the Categories and Products arrays follows:
public static class Data
|
Nested Queries
When working with two parent/child-related sequences - categories and products, in this example - you may want to access all of the records in the parent sequence one at a time and, for each parent record, work with its children. Such functionality is possible in LINQ in two ways: by writing a nested query or by using LINQ's
GroupBy operator. Underneath the covers, both approaches are the same. They differ only in the syntax. As we'll see, a nested query is a bit more verbose and
not as readable as using the GroupBy operator.
To understand how a nested query works, let's first look at writing a query that returns the categories. Using the query syntax added to C# 3.0 and Visual Basic 9 we could write:
// C#
|
Now, imagine that for each category we wanted to get back the total number of products. We could accomplish this by using a nested query. A nested query is a query that appears
in the Select portion of the query. Consequently, the nested query is executed once per item returned from the outer query. If that doesn't quite make sense,
perhaps this syntax will clarify the issue:
// C#
|
This new query uses projection to create an anonymous type with two members: CategoryName, which contains the name of the category record; and ProductCount,
which indicates the total number of products associated with this particular category. Note that the nested query gets those products whose Category property matches
the current Category object from the outer query. In plain English, for each category in the FauxNorthwind.Data.Categories array a new object
is created. In creating that object the ProductCount is saying, "Get me all of the products that have the same category as the one that is currently
being processed and then count up how many elements are in this nested query's results." The end result is that for each category in the FauxNorthwind.Data.Categories
array a new object with the category's CategoryName property and the total number of products associated with that category is returned.
We could add another nested query, one that returns the entire sequence of related products. To do so simply add another member to the anonymous type like so:
// C#
|
The demos available at the end of this article bind the grouped and joined data to a GridView to illustrate how each query operator works. The above query generates a grid with the following data:
Note that resulting data includes all of the categories regardless of whether they have any associated products. This is because the outer query gets all categories
and then gets the related products (and the number of related products) via a nested query. Consequently, each category in the FauxNorthwind.Data.Categories array
will be returned; it doesn't matter if there are any related products or not.
Grouping Data
While nested queries certainly will get the job done, LINQ includes a more terse and readable syntax for grouping data. The
GroupBy standard query operator
allows the items in query to be grouped by some property. Like many other standard query operators, the GroupBy operator returns a sequence. Specifically, each
element returned by GroupBy implements the interface IGrouping<TKey, TSource>. This interface has a strongly-typed key (of type TKey)
and a sequence of grouped items of type TSource. That description can be a bit confusing. Perhaps an example will help.
Imagine that we want to get all of the products and group them by category so that we end up with a sequence of categories and for each category we have a sequence of its
corresponding products. This could be accomplished using the GroupBy standard query operator like so:
// C#
|
The GroupBy operator uses a lambda expression to indicate what property the source data (the Product objects in FauxNorthwind.Data.Products)
should be grouped by. In this case we group by the Category property. The net result is that the GroupBy operator returns a sequence with
five elements - one element for each of the five different Category objects found in the FauxNorthwind.Data.Products array. Each element has a Key property
(that returns the Category object) along with a sequence of the Product objects that belong to that group.
The statement in the code snippet above takes these results from the GroupBy operator and uses the Select operator to project them into a more
usable format. Namely, the Select operator creates an anonymous type with three properties:
CategoryName- the name of the category. Note that the category information is accessible via theKeyproperty.ProductCount- the total number of elements in the group's sequence. In other words, the number of products associated with this category.Products- the sequence associated with this group. Namely, the products that belong to the category group.
Categories array and then ran a nested query to get the related products. In other words, that query
started by getting all of the categories and then each category's products (if any existed). However, the GroupBy example explicitly gets all products and
then groups them by category. Those categories that have no products are never retrieved.
You can also use the GroupBy operator using the query syntax. The code snippet below shows an equivalent statement to the standard query operator usage above, but
uses the query syntax rather than extension methods.
// C#
|
When using the query syntax you use the Into keyword to name the sequence of groups returned by the GroupBy operator. In C# you can use any name
(such as categoryProducts), but in Visual Basic you must use the keyword Group. Furthermore, in Visual Basic you specify the name of the Key
property directly in the grouping syntax - Group product By key = product.Category Into Group.
Joins
Nested queries and grouping in LINQ is useful, but differs from the manner by which different entities are typically combined in relational databases. In SQL, you can return data from two (or more) tables into a single result set using a join. There are different types of joins in SQL, including inner joins, outer joins, and cross joins. I don't want to belabor how joins work in SQL in this article, as this topic is covered in depth elsewhere. See the SQL Joins tutorial at w3schools.com, or Jeff Atwood's A Visual Explanation of SQL Joins post for more information.
LINQ includes a Join standard query operator that allows for SQL JOIN-like behavior and
syntax (when using LINQ's query syntax). This article shows how to perform four different types of joins in LINQ:
- Group Join,
- Inner Join,
- Left Outer Join, and
- Cross Join
Group Joins
A group join is similar to using the
GroupBy operator, although the syntax and semantics differ slightly. In a nutshell, a group join uses the Join
operator and the Into keyword to group the results of the join. The following query returns all categories and uses a group join to create a group of
related products for each category.
// C#
|
Note that to use a group join in Visual Basic you must use the syntax Group Join instead of join, as is used in C#. In both cases, the results
are grouped via the Into keyword. As in the GroupBy operator example, C# can group the data into a group of any name whereas in Visual Basic you must
use the name Group.
The results of the group join are identical to the nested query's. Namely, the result is a sequence with one element for each category, regardless of whether the category has any associated products. Each element has a "sub-sequence" that contains the products associated with that category (if any). Refer to the first screen shot in this article for a visual representation of these results.
Inner Joins
An inner join queries two sequences. For each pair of elements examined, the pair is returned as output if the pair match on some specified criteria. Otherwise the pair is omitted from the results. Let's look at an example. The following query syntax queries two sequences -
FauxNorthwind.Data.Categories and
FauxNorthwind.Data.Products - joining together the Product and Category objects with matching categories. The Join operator generates a sequence
whose elements have two properties: category and product. This result is projected via the Select operator into an anonymous type
that returns the just the CategoryName property from the category object along with the entire product object. (An equivalent query
using extension methods instead of the query syntax is available in the code download; examine the comments in the InnerJoin.aspx page's code-behind class.)
// C#
|
The above query syntax is quite similar to SQL's JOIN syntax. Here we query "from" the FauxNorthwind.Data.Categories "table" and join with
the FauxNorthwind.Data.Products "table" where the Category objects line up. If it helps, you can envision the above LINQ query as the following SQL statement:
SELECT Categories.CategoryName, Products.*
|
The LINQ query generates a sequence with 15 elements, one for each product in the "database." Each item in the resulting sequence includes the category's name and all fields
from the Product object. The following screen shot shows a GridView that displays the CategoryName property in the first column and the product's
ProductName and UnitPrice properties in the second column.
The result from a LINQ Join differs than those generated by a nested query, the GroupBy operator, and by group joins. The nested query,
GroupBy operator, and group joins return a sequence that contains a "sub-sequence." For example, when using the GroupBy operator to group products
by category, the resulting output was a sequence that contains a Key (the Category) and a sub-sequence of Product objects that
belonged to that category. On the other hand, the Join operator returns a sequence that contains the two elements being queries (each Category
and Product object whose categories align).
Left Outer Joins
An inner join only returns items where the two input sets intersect. For example, in the above screen shot the only categories that are returned are those that have a matching product. Similarly, any products without a category would be omitted. An left outer join returns all records from the first sequence (the "left" one) regardless of whether there is a matching element in the second sequence (the "right" one).
A left outer join is implement in LINQ by using a group join on the "left" table and then using another From operator to get the matching products, if any
exist. The two From operators are needed to return the data in the same format as the inner join (having the category and product properties side-by-side rather
than in a sub-sequence). The following query uses a left outer join to return a record for all categories, regardless of whether there are matching products.
// C#
|
Note that there is both a group join and an additional From operator that queries the resulting group join sub-sequence, which are the products that match
the category. The DefaultIfEmpty() extension method is used here to return null if there are no items in the sub-sequence (i.e., if there are no matching
products for the category).
The resulting output of the above query is a sequence with 16 elements. One element for each product plus one element for the category that has no products associated with it (Meat/Produce). The screen shot below shows this data when bound to a GridView.
Cross Joins
A cross join is the Cartesian product between the two entities performing the join. In other words, the cross join outputs an element for every single possible combination of elements in the two inputs. When doing a cross join on two sequences with sizes A and B, the resulting cross join has A x B results. A cross join in LINQ is accomplished by having two
From operators that specify the two sequences to combine, as the following query syntax illustrates:
// C#
|
The above query generates a resulting sequence with 90 elements and contains all of the category/product combinations, even if the combinations aren't valid. For instance, the resulting sequence includes the pairing of category and product Meat/Produce and Chai, even though Chai product belongs to the Beverages category and the Meat/Produce category has no associated products.
Because so many records are returned by the cross join, a screen shot would be impractical here. To see the output of the cross join download the code available at the end of
this article and run the CrossJoin.aspx demo.
Happy Programming!
Attachments:
Further Reading
GroupBy Operator SyntaxJoin Operator Syntax
| A Multipart Series on LINQ |
|---|
|
This article is one in a series of articles on LINQ, which was introduced with .NET version 3.5.
|



