Data Reduction in Primary Storage – Boon or a Burden?

In recent times there are many storage players embarking into the market and the competition is fierce. Storage vendors would not like to be behind in offering rich features in their products to attract customers. In addition, there are FUDs around dealing storage in a traditional ways due to advent of many disruptive technologies and the trend is going to grow upward in coming days.

This blog focuses only on data reduction features offered by primary storage vendors in general and its benefits and downsides. Data reduction in storage means attempt to store lesser data in disks versus the amount of user data generated. This implies increasing storage efficiency in terms of capacity and hence reduced costs. Let’s understand further on this and its implications to customers.

Calsoft Whitepaper: Software Defined Storage-Quality Assurance

The key objective of this paper is to illustrate how to carry out quality assurance for a software defined storage, which comprises not only feature testing, performance testing, system testing & regression testing; but in parallel also to help understand the various testing tools, needs & available options for automation.


Data reduction in storage can be achieved using several techniques. De-duplication and compression are the two prominent techniques being used since backup storage times and they have gained much significance in primary storage in last few years. There are many storage vendors that offer data reduction features in their primary storage products including EMC, NetApp, Microsoft, Oracle Sun, IBM Storwize, Dell Ocarina Networks, Pure Storage, Naxenta, Tegile systems and host of many more.

It’s essential to understand what primary storage compression and de-duplication means, what type of data pattern is really suitable to adapt this technology, what types of data dedupe methods to follow and their pros and cons.

Compression is a data reduction technique using certain algorithms to reduce the amount of physical disk space. Compression can be done at file system level or storage array level. Compression algorithms can be lossless or lossy. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless. LZ compression methods are most popular in lossless such as LZ1, lZ2, etc., For example, lZ1 is the basis of GZIP, PKZIP, WINZIP, ALDC, LZS and PNG among others. LZ2 is the basis for LZW and DCLZ.  Whereas lossy compression reduces bits by identifying non-essential information and removing it and some loss of information is acceptable.  JPEG, MP3/MP4 compression methods are some of popular lossy compression algorithms. Lossy scenarios mostly found in video streaming and photographs. Lossy algorithms are more efficient than lossless, but when one tries to uncompress, it’s impossible to get back the original content. Compression ratio is the one which quantifies the reduction in data produced by a compression algorithm. Typically it’s a ratio between data uncompressed size and data compressed size. Compression ratio often determines the complexity of a data stream and used to approximate algorithmic complexity.

In contrast, de-duplication reduces the size of the data by detecting repeating patterns in data and reducing such patterns to a single instance and leaving pointers to that instance. De-duplication logic can be applied at file or block level. De-duplication techniques are of three types, viz., source side de-duplication, inline de-duplication and post process de-duplication. Source side de-duplication is removing repeating patterns at the data source before transmitting to the storage. The key benefit of source side dedupe is to use reduced network bandwidth and hence efficiently utilize the storage with lesser data foot prints. Inline de-duplication is removing repeating patterns on the fly when data is written to storage device. Inline dedupe is CPU intensive and can bring down overall storage performance. The key benefit of inline dedupe is to efficiently utilize storage capacity when data is written to disk. Inline dedupe is the one operating in storage I/O path. Post process de-duplication is removing repeating patterns only after data is written to the disk. Data redundancy is eliminated by either running a scheduled task typically during non-peak hours or automatically based upon certain growth of data. The key benefit of post process de-dupe is performance, since it doesn’t intercept storage I/O path, but the downside is to have enough storage capacity to retain the data before they get reduced. Like compression ratio, de-duplication ratio is what quantifies the reduction of data using de-duplication technique. De-dupe ratio is typically expressed as a ratio between protected capacity to actual physical capacity stored on disks. Higher the de-dupe ratio, better the data reduction efficiency.

Different storage vendors use different types of compression and dedupe logic. Some use inline, others use post process. Again, data reduction using these techniques depends upon other important factors such as data stream types, data retention policy, rate of data change. If implemented correctly, compression and dedupe will help significantly to reduce the data foot print in tier one storage. In recent times with flash based storage, data reduction technologies have turned out to be a boon of efficiently utilizing premium priced flash storage capacity. But that does not come free; it comes with some performance impact. So, it’s important to evaluate before using a storage product of a specific vendor to consider performance as the highest focus or a feature rich product. There are many such products in the market that are feature rich but have not really kept their promise when performance matters.

Some storage array offers dedupe and compression along with encryption. So, it’s important to run a proof of concept on primary storage arrays before considering dedupe and compression. If data reduction feature is not going to be much useful for specific workloads, then the feature can be tuned to disable to optimize performance, else it will turn out to be a burden than real benefit. For a customer, it’s not important whether to consider a primary storage array to be feature rich; it all depends upon how the customer previews a particular product to be more useful for its specific needs.

To know more email:

Contributed by: Santosh Patnaik | Calsoft Inc

Calsoft Storage Expertise

Leveraging years of experience with Storage platforms, ecosystems, operating systems and file systems, Calsoft stands as pioneer in providing storage product R&D services to ISVs. Our service offerings enable storage ISVs/ vendors to quickly develop next generation storage solutions that can perform and cut across enterprise IT needs.

Santosh Patnaik

Santosh Patnaik

More than 14 years of solidIT experience in software development, customer support and quality assurance. Extensively worked in the areas of Enterprise Data Storage across NAS/SAN/DAS. As a subject matter expert, highly focused on Network storage, Filesystem, Storage performance in particular. Currently working as a QA architect at Calsoft.
Santosh Patnaik

Leave a Reply

Your email address will not be published. Required fields are marked *