Microsoft's SQL Server Integration Services (SSIS) is an evolutionary rewrite of the company's SQL Server extract, transform and load (ETL) platform, DTS (Data Transformation Services). Microsoft rewrote DTS from the ground up to address the increasing importance of business intelligence (BI) tools, and to a lesser degree, the explosion of data and customer feature demands. We'll explore how those changes have affected ETL processes and how service providers can put SSIS to good use.
Data growth leading to BI demands
When SQL Server 2000 was released, it was estimated that the entire world's digital data totaled 1 petabyte (1 million terabytes). Estimates done in 2007 revised that total to 7 petabytes.
Why such a huge increase in data?
In the past, with prohibitive storage costs and relational database management systems (RDBMSes) unable to process large volumes of data, most companies warehoused only the data that they judged important to their data warehouse goals. As storage costs decreased, server horsepower increased and business intelligence tools became more sophisticated, data architects frequently found that data that could have yielded business intelligence just didn't exist -- it was either not recorded, not collected, not warehoused, not accessible, not recoverable or all too often not retained (in other words, deleted).
The pendulum then swung in the other direction, in the direction of full retention. Today, businesses realize how critical it is that they derive actionable intelligence based on complete records of their historical data. They know that what seems irrelevant today may become critical in the future; they understand that their data has tremendous value and that it's important to warehouse it and keep it online. The trend now is to collect and retain all that data rather than leave it uncollected or delete it.
Complicating this growth in data, another factor is driving the need for more invisible business intelligence operations. Today's businesses are global -- they follow the sun. At 6 p.m., your customer's data warehouse in Hoboken might be accessed by users in Sydney, where it is 8 a.m. In the past, data loads were run overnight using a two-write staging process, whereby data was cleansed and then loaded into the destination tables. Today, global access requires that data loads be incremental, fast and have a low resource footprint.
So how does SSIS play into this scenario?
Microsoft's approach in revamping DTS was to orient the tool set around business intelligence. ETL functions handle the data loading, but SSIS goes beyond simple ETL functionality to an integrated BI environment.
SSIS is now bundled as part of a suite of products under the BI umbrella. The BI suite includes SSIS, Analysis Services and Reporting Services. These services, SSIS and the text and data mining features are accessed through a Visual Studio Shell called Business Intelligence Development Studio (BIDS). BIDS is a one-stop shop that allows data architects to use SSIS to extract, transform and load data into their warehouses, cleanse it, prep it for mining and then analyze the data within Analysis Services. Analysis Services allows the data to reveal itself to the data and business analysts, so they can do reporting, forecasting and trending.
SSIS has good controls for improving the quality of data. If data is not cleansed and consolidated, it frequently cannot be used for business insight. For example, if data in a field is not discrete enough and is uncategorizable, all we really know is that it was collected. Consider a medical clinic trying to derive business intelligence from patients' medical tests stored in a table. If the results of each test are stored in a column with the lab's test result codes and the data architect has no understanding of what these codes mean, all he knows for sure is that the tests were done. This is an unfortunate result of a lack of uniformity in medical test result reporting. Doctors or medical support staff will have to analyze the test results and then adjust the entry in the database to make it actionable.
SSIS also improves the handling of duplicate data, which must be detected and purged. This may involve lookup tables to reformat the data into a normalized form (for example, the same customer may exist in different tables, and there may be no recognition that it is the same person) or conversion of data from one unit to another. SSIS includes "fuzzy lookup" and "fuzzy grouping," which will reconcile variations in data inconsistencies -- for example, if "NYC" and "Manhattan" are not corrected to a common format, proper relationships might not be drawn. Auditing requirements frequently require data tracking, and SSIS has rich tools to permit this.
SSIS also has improved performance over DTS. In most ETL tools, data is stored in transient work tables through a process called staging. Data may be imported, converted and stored to a staging table for further processing, and then written to its final destination, another table. SSIS, on the other hand, caches data in memory to provide fast access and manipulation in RAM and avoids having to go to disk multiple times. The improved performance of SSIS is important, especially for businesses with a global presence. They don't have the luxury of performing their ETL cycles at night, as has been the custom for companies using older ETL tools, since having international customers necessitates 24-hour data access.
The Reporting Services feature, introduced late in the SQL Server 2000 lifecycle, allows businesses to create Web-based reports that can be archived as Excel spreadsheets and PDFs, for example. Reporting Services can be used to display SQL Server or Analysis Services data.
Monetizing SSIS services
For service providers interested in selling SSIS to their customers, there are a number of key selling points:
- SSIS ships with many components that allow data architects to quickly derive business intelligence from their data. There's an opportunity for service providers to cross-sell other business intelligence services along with their SSIS-centric offerings. SSIS consolidates and cleanses the data. Service providers can take it to the next step by educating their clients on how to use SSIS to integrate, cleanse, analyze and report on their data, and by showing customers how to use the BI tool set to increase profitability and lower their bottom line.
- SSIS is a flexible architecture that allows for remodeling on the fly. With most ETL tools, modifications to the existing imports might involve retooling. SSIS allows data architects to be much more productive than with other ETL tools; this increased productivity comes from SSIS's pipeline architecture; break-point features, which enable rich debugging; and the ability to pipe rows, which error out to error tables. SSIS has many components that assist with data consolidation and data quality to help with the data load process.
- SSIS performs much better than most ETL tools. To take maximum advantage of its performance advantages, data architects should abandon traditional staging tables and staging areas. This is not to say, of course, that SSIS does not work well in traditional staging environments. SSIS is also designed for incremental refreshes of data, as opposed to the traditional batch operations that occur nightly in most data warehouses. If you can educate customers on this new paradigm shift, they can capitalize on SSIS's powerful engine.
SSIS is a powerful platform to provide ETL processes used in data warehousing. With SSIS, Microsoft has rewritten DTS to address data explosion and business requirements to derive intelligence from data. Service providers should provide leadership for their clients in how to correctly deploy SSIS. They should take the data warehouse to the next level and add business intelligence functionality to enable customers to see trends, do forecasting and react to any changes their data may reveal.
About the author
Hilary Cotter has been involved in IT for more than 20 years as a Web and database consultant. Microsoft first awarded Cotter the Microsoft SQL Server MVP award in 2001. Cotter received his bachelor of applied science degree in mechanical engineering from the University of Toronto and subsequently studied economics at the University of Calgary and computer science at UC Berkeley. He is the author of a book on SQL Server transactional replication and is currently working on books on merge replication and Microsoft search technologies.