Converting Flat, Comma-Delimited Values Into a Normalized Data Model
By Scott Mitchell
Introduction
In my job as an independent software developer I help a lot of small businesses enhance their existing company website or internal web applications to
include new features or adopt best practices. Many of these businesses have vital line of business applications that were created many years ago by
an employee who was not a professional software developer, but perhaps a member of the IT team or someone who was learning how to program or programmed
as a hobby. A common mistake made by people without a solid background in creating data-driven applications is using flat, non-normalized data models.
Consider an application used in a healthcare setting may need to record each doctor's professional and educational degrees. Because there are a fixed
number of degrees - PhD, MD, DDS, OB/GYN, RN, etc. - these degrees should be spelled out in a separate database table. And because each doctor can have
multiple degrees, there should be a third table that maps what doctors are associated with what degrees. Such a data model would be normalized. A non-normalized
data model would instead try to capture each doctor's degrees within the same table that contains the doctor's other information (his name, address, DOB, etc.).
This might be implemented as several columns in the table (Degree1, Degree2, Degree3, and so on) or as a single
column that contains a comma-delimited list of degrees, like "PhD, MD, OB/GYN".
While there are certain circumstances where non-normalized data is ideal, in the vast majority of situations having the data expressed in a normalized
manner is ideal. Normalized data is easier to work with, is easier to report against, is (usually) more efficient in terms of both disk space and
time to execute queries, and is less likely to suffer from any data integrity issues, which are all too common in non-normalized data. I recently helped
a client who had a many-to-many relationship implemented in a flat, non-normalized manner convert that data into a normalized data model through the use
of a T-SQL script. This article discusses why it is worhtwhile to convert flat, non-normalized data into a normalized data model and steps through
how this T-SQL script can be used to normalize your data. Read on to learn more!
An Example of Non-Normalized Data
I recently worked with a client that had a legacy custom web application created by a past employee that
was used internally to manage customer and employee information. The application was used primarily to store information about employees and customers
and to facilitate searching for customers and employees. For example, each employee can belong to one or more departments, where departments are things
like IT, Executive, Sales, and so on. From the web page for editing employees, a user could pick an employee, select zero to many departments
from a multi-select list box, and save the results to the database. The search page, which was the most heavily used page on the site, allowed the
visitor to select zero to many departments from a list box to filter the results.
The employee who created the application had long since moved on to another job, but the web application was stable enough that it had continued to be
used for many years without issue... until recently! A new branch office opened up on the West coast and to differentiate departments between the two offices, the
departments were replicated for the new site with the new site's name. In other words, there were now departments like "IT" and "IT (West)," and "Sales"
and "Sales (West)." The idea was that the "Sales" department was the people in sales in the original branch office, whereas the "Sales (West)" department
was the sales team in the West coast office. After these new departments were added the search page no longer worked as expected. Searching for employees
in the "Sales" department returned employees in both the "Sales" and "Sales (West)" departments. The person whose job it was to manage this application
concluded (correctly) that the problem was that "Sales" was a substring found in both department names. In an attempt to fix this she changed the
"Sales" name in the Departments table to "Sales (East)", but after doing so searching for "Sales (East)" returned no employees even though
there were dozens of employees in the previously named "Sales" department. This was when I was called in.
The problem was that the developer who created the application had used two tables to model this relationship: Departments, which contained
a record for each department; and Employees, which contained a record for each employee. There should have been a third table used to
model the relationship between employees and departments (since the relationship is many-to-many), but the developer instead used a flat, comma-delimited
column value in the Employees table named Departments. If an employee belonged to the Executive and Sales
departments their Employees.Departments would contain the value "Executive, Sales". The search page filtered employees
by using a series of LIKE statements for each selected filter criteria. If a user searched for employees in the Sales department it
would run a query with a WHERE clause like: WHERE Employees.Department LIKE '%Sales%'. As a result, such a WHERE clause would
return employees in the Sales department and employees in the "Sales (West)" department.
When the "Sales" department was renamed to "Sales (East)", this didn't rename the corresponding value in the Employees.Departments column.
In other words, if the Employees.Departments column value was "Executive, Sales" when the department was named "Sales", it remained
"Executive, Sales" after the department was renamed to "Sales (East)." But after the department was renamed the search page now used the WHERE clause
WHERE Employees.Department LIKE '%Sales (East)%', which returned no records because those employees in the "Sales (East)" department has the string
"Sales" in their Employees.Departments column (and not "Sales (East)").
The Ideal Approach - A Normalized Data Model
Ideally, a many-to-many relationship between entities A and B involve three tables:
Table A
Table B
The Join Table
The Join Table is what maps elements in A to elements and B and vice-a-versa. It is typically comprised of the primary key(s) of Table A and Table B,
and those joint primary keys in the Join Table form a composite primary key.
In case the above sounds like jibberish, let's look at how the data model should have been constructed in the application described above. First we have
the Employees table (Table A), which models employees; each employee is uniquely identified by the EmployeeID column, which is an integer
IDENTITY column. We also have the Departments table (Table B), which likewise has an IDENTITY column named
DepartmentID. The third and final table we need (which was lacking from my customer's data model and was the source of the problems noted
above) is a table named, say, EmployeeDepartments with the following schema:
EmployeeDepartments
Column
Data Type
Notes
EmployeeID
int
Composite primary key
DepartmentID
int
This is the table that ties together what employees belong to what departments (and, conversely, the list of employees that makeup a praticular department).
Had this data model been in place, none of my client's problems would have surfaced. New departments could be added at any time and the names of departments
could be changed as needed. There would be no naming or substring issues because the employees' departments are specified through unique integer values.
The names of each department are not construed in this relationship.
After explaining this to my client his response was predictable: "Great, now how do we get this data model in place for our existing application."
Altering the web application to use the preferred normalized data model required three significant changes:
Adding a new table, EmployeeDepartments, and converting the existing data in the Employees.Departments column into
this new table.
Updating the web page where employee data was saved so that instead of saving the value as a comma-delimited value in the Employees
table, a record was added to the EmployeeDepartments table for each selected department.
Updating the WHERE clause used in the search page to go against the EmployeeDepartments table rather than to use a
series of LIKE clauses.
This article focuses on the first task - converting the existing flat, comma-delimited data model into a normalized one.
Creating the EmployeeDepartments Table
The first order of business in converting the existing flat data model into a normalized one is to create the many-to-many Join Table.
The schema for this table is shown above. To create this table use the following T-SQL script, which generates the table and adds foreign key
constraints to the two "parent" tables, Employees and Departments:
CREATE TABLE [dbo].[EmployeeDepartments](
[EmployeeID] [int] NOT NULL,
[DepartmentID] [int] NOT NULL,
CONSTRAINT [PK_EmployeeDepartments] PRIMARY KEY CLUSTERED
(
[EmployeeID] ASC,
[DepartmentID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[EmployeeDepartments] WITH CHECK ADD CONSTRAINT [FK_EmployeeDepartments_Departments] FOREIGN KEY([DepartmentID])
REFERENCES [dbo].[Departments] ([DepartmentID])
GO
ALTER TABLE [dbo].[EmployeeDepartments] WITH CHECK ADD CONSTRAINT [FK_EmployeeDepartments_Employees] FOREIGN KEY([EmployeeID])
REFERENCES [dbo].[Employees] ([EmployeeID])
GO
Adding a Record to EmployeeDepartments For Each Department Specified in Each Employee's Departments Column
The final step in this data convertion is to enumerate through each employee in the Employees table, parsing the departments out of
the Departments column, and inserting a record into the EmployeeDepartments table for each department for the current
employee. This requires the use of a CURSOR, which is a T-SQL construct that allows us to work with each record returned by a SELECT
statement individually.
The following T-SQL shows the shell of the CURSOR.
DECLARE @EmployeeID int, @Departments nvarchar(4000), @DeptName nvarchar(50), @DepartmentID int
DECLARE empCursor CURSOR LOCAL FAST_FORWARD FOR
SELECT EmployeeID, Departments
FROM Employees
OPEN empCursor
FETCH NEXT FROM empCursor INTO @EmployeeID, @Departments
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT 'Starting parsing for employee ' + CAST(@EmployeeID as nvarchar(10))
...
FETCH NEXT FROM empCursor INTO @EmployeeID, @Departments
END
CLOSE empCursor
DEALLOCATE empCursor
Note that the CURSOR starts by defining a SELECT statement. In the above script the values in the EmployeeID and Departments
columns are returned for all employees in the Employees table. The CURSOR then FETCHes one record at a time,
storing the values of the EmployeeID and Departments columns in the @EmployeeID and @Departments
variables. Each record is enumerated via the WHILE loop. For more background on CURSORs be sure to read
Cursors: An Overview.
With each employee record we need to parse the comma-delimited list of departments, which is stored in the @Departments variable.
A previous article here on 4Guys, Creating a User Defined Function in SQL for Comma-Delimited Searches,
shows how to parse a comma-delimited list and work with each value in the list as an individual unit using various T-SQL string processing functions.
I replicated the essence of the code in that article to use for this client. In short, I get each department name from the comma-delimited list and
use it to grab the corresponding DepartmentID value in the Departments table. If a corresponding DepartmentID value
is found I add a record to the EmployeeDepartments table using the @EmployeeID and the just-retrieved DepartmentID values.
...
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT 'Starting parsing for employee ' + CAST(@EmployeeID as nvarchar(10))
IF @Department IS NOT NULL AND LEN(@Department) > 0
BEGIN
-- Add a comma to the end of @Department (for parsing reasons); this tip suggested by reader Wissam Bishouty
SET @Department = @Department + ','
-- Pick apart the @Department string
WHILE CHARINDEX(',', @Department) > 0
BEGIN
SET @DeptName = LTRIM(RTRIM(SUBSTRING(@Department, 1, CHARINDEX(',', @Department) - 1)))
-- Does @DeptName exist in Departments?
SET @DepartmentID = (SELECT DepartmentID FROM Departments WHERE DepartmentName = @DeptName)
IF @DepartmentID IS NULL
PRINT ' !!! Employee ' + CAST(@EmployeeID as nvarchar(10)) + ' has a department value not found in Departments: ' + @DeptName
ELSE
BEGIN
PRINT ' Adding department ' + @DeptName
-- Add a record to EmployeeDepartments (if it does not already exist)
IF NOT EXISTS(SELECT 1 FROM EmployeeDepartments WHERE DepartmentID = @DepartmentID AND EmployeeID = @EmployeeID)
INSERT INTO EmployeeDepartments(DepartmentID, EmployeeID)
VALUES(@DepartmentID, @EmployeeID)
ELSE
PRINT ' Record already found in EmployeeDepartments'
END
SET @Department = SUBSTRING(@Department, CHARINDEX(',', @Department) + 1, LEN(@Department))
END
END
ELSE
PRINT ' No departments for the employee!'
FETCH NEXT FROM empCursor INTO @EmployeeID, @Departments
END
...
Executing the Above Script
Let's run the script on a sample database to show its effectiveness. (Of course, before running any script on a production database make sure
you make a backup beforehand!) The Employees table contains six employees, five of which belong to a department. As you can see in the
screen shot below, Sam does not belong to any departments. Scott and Ernie belong to one department each, while Alice and Jisun belong to two, and
Tito belongs to three.
The Departments table defines five unique departments.
Running the above T-SQL script generates the following output in the Messages window:
Starting parsing for employee 1
Adding department IT
Starting parsing for employee 2
Adding department Executive
Adding department IT
Starting parsing for employee 3
Adding department Sales
Adding department Marketing
Starting parsing for employee 4
No departments for the employee!
Starting parsing for employee 5
Adding department Executive
Adding department Sales
Adding department Marketing
Starting parsing for employee 6
Adding department Human Resources
As you can see, employee 1 (Scott) and employee 6 (Ernie) each have one record added to the EmployeeDepartments table, whereas
employee 4 (Sam) has none added. Employee 5 (Tito) has three departments added, and so on.
The EmployeeDepartments table contains the appropriate records as evidenced by the output above and the screen shot below.
An Alternative, CURSOR-less Approach
Alert reader Leonid K. wrote in to share another technique for converting flat, comma-delimited values into a normalized data model
without the need of a CURSOR. The following INSERT statement does the trick:
INSERT EmployeeDepartments(EmployeeID, DepartmentID)
SELECT e.EmployeeID, d.DepartmentID
FROM Employees e
INNER JOIN Departments d ON
',' + e.Departments + ',' LIKE '%,' + d.DepartmentName + ',%'
One issue to keep in mind with Leonid's approach is that it is imperative that the Employees.Departments column have no
spaces before or after the department names. In other words, the above approach will work as expected for Employees.Departments
values like "IT,Human Resources", but not for values like "IT, Human Resources" (note the space after the comma and before the "H" in "Human Resources").
Conclusion
Ideally all applications will use properly normalized data models. However, applications designed by junior level programmers or those who are not
professional developers often include flat, non-normalized data models. With a bit of elbow grease it's possible to create a SQL script that will
automatically convert the flat data model into a normalized one. This article showed a script for converting a many-to-many relationship modeled by
a comma-delimited list into a three-table, normalized data model.