“Big data” is a term that’s been used, and probably overused, to describe compute environments with extreme requirements that are, to some extent, unmet by traditional IT infrastructures. This inadequacy
One definition of big data, surrounding the concept of big data analytics, was tied to large pools of randomly accessed files or in some cases databases that performed high-performance (often real-time) analytics, such as those involving financial transactions, Internet-facing applications, scientific analysis, event logging, etc. These use cases typically involved very large numbers of small data objects that had to be made available for this high-speed analysis.
There’s another definition of big data, one we refer to as “big data archive.” These reference archives need to keep very large numbers of (typically) large files available, usually supporting sequential processing workflows, such as technicians analyzing remote sensing data or video specialists working on different steps in motion-picture post-production.
These data sets have high retention requirements, since many of the files created represent a salable finished product or are subject to regulatory compliance and must be kept for many years.
These archive use cases have a performance component as well, exemplified by the death of a celebrity. In the case of Whitney Houston or Michael Jackson, for example, large numbers of video files needed to be pulled from years’ worth of archives to support news stories and features in a short time. In big data analytics, meanwhile, this performance challenge comes into play in providing the IOPS required to keep the real-time analysis engines supplied with data. For both use cases, consistency of performance can be the critical factor, since traditional storage infrastructures can have problems maintaining predictable performance as they grow.
While capacity, per se, isn’t the defining characteristic for a big data use case, it’s part of the reason there’s a big data discussion in the first place. Essentially, the dramatic growth of data sets has outstripped the abilities of traditional storage infrastructures to provide the performance required to support them. And its sheer size has made storing big data a costly undertaking. The storage infrastructures must scale affordably, often into the multiple-petabyte range.
Now let’s take a look at some big data technologies and products that attempt to address these issues.
If you can reduce the amount of data stored, everything else seems to get better, and this may be especially true in a big data environment. Compression and deduplication are two examples of this strategy, but applying these technologies to databases can be more complicated than with file data.
One company that has tackled this problem is RainStor. It has developed big data technology, also called RainStor, that provides this data reduction in a structured environment. It can deduplicate and store large sections of a database, providing up to 40-to-1 reduction in the process. It can then allow users to search this compressed database without “rehydrating” the data, using standard SQL query access. The company recently released an edition of its product for the Hadoop environment. It runs natively in the Hadoop cluster on Hadoop Distributed File System (HDFS) files, providing similar data reduction performance and management efficiencies using SQL queries or MapReduce analytics.
In the area of database performance, GridIron has developed a block-based cache appliance, called TurboCharger, that leverages flash and DRAM to provide application acceleration up to 10 times in high-performance environments such as Oracle Real Application Clusters (RAC). Compared with traditional caching methods, which use file system metadata to make caching decisions, GridIron creates a “heuristics-driven map” of billions of data blocks on the back-end storage. This enables TurboCharger to run predictive analysis on the data space and place blocks into cache before they’re needed.
As a true cache, the TurboCharger can reduce storage system workloads by as much as 90% and improve write performance up to four times. The system installs on the network, transparent to applications on the front end or storage systems on the back end. It requires no changes to storage volumes and can be scaled out by adding more appliances as bandwidth demands increase.
In big data archiving, the challenge can be managing the file system environment and scaling it to accommodate very large numbers of files. Quantum’s StorNext is a heterogeneous SAN file system that provides high-speed, shared access among Linux, Macintosh, Unix and Windows client servers on a SAN. In addition, a SAN gateway server can provide high-performance access to LAN clients.
Also part of StorNext is the Storage Manager, a policy-based archive engine that moves files among disk storage tiers and, if implemented, a tape archive. StorNext also supports asynchronous replication of directories between SAN-based clients for data protection as well as integrated deduplication to reduce storage and bandwidth usage.
Big data opportunities
Generally speaking, existing storage infrastructures are not keeping up with the performance (IOPS and streaming), economical capacity and flexibility requirements presented by big data, both in the analytics and archive use cases. Given the scope of the problem, the big data technologies these customers settle on will usually include components from multiple vendors. This means users need VARs and system integrators that can represent a broad spectrum of technologies and integrate them into the most effective solutions. For these reasons, big data can present some big opportunities for VARs.
Eric Slack is a senior analyst with Storage Switzerland.
This was first published in April 2012