Introduction

This blog is about my personal passion that is technology. Working in technology is fun, despite its odd working hours and stress. I enjoyed working in the technology field and learnt great deal about the value it generates to everyone. I don't claim to super expert on technology, but I can share some thoughts that might be interesting.

I would like to write few posts on various tech topics starting from data storage to video serving and see where this blog go. So if you like to share your thoughts, send your comments.

Happy blogging!
-ravi

Monday, April 7, 2008

De-duplication: what is it?

This is the data storage technology I see lot of promises in future. Of course there is lot of confusion about its promise and so beaware that it's potential depends upon the type of applications.

De-duplication is a disruptive technology that reverses “duplication” of data in the traditional backup environments. The technology reduces data to be backed up by the order of magnitude amount. The traditional backup solution burdened with growing backup window and amount of data archived. Compression techniques and retention policies have been implemented to address the data growth. However these methods have minimal reduction of data growth burden.

De-duplication process work on reversing this growing data. The technology divides the data into segments and only the segments that have been modified will be backed up to the secondary storage. The redundant segments are determined by a commonality factor and will not be part of the backup. A typical de-dupe application is expected to reduce around 200-500x data reduction in backup environment. This technology is disruptive to the current backup environment and will be playing major role in coming 3-5 years.

The de-duplication saves WAN bandwidth and growth of secondary storage by reducing overall data in backup process. In affect it will reduce network bandwidth costs, secondary storage costs, support costs and installation costs. De-duplication can be performed at source or target of backup data. The source de-dup reduces the network bandwidth as well as secondary storage, where as target de-dupe will reduce secondary storage. EMC’s Avamar and Data Domain Appliance series are examples of source-based and target-based de-duplication.

Major storage players are focusing on the de-duplication. Many backup software vendors are started working on de-dupe solutions as part of their offerings. The archiving vendors such as VTL vendors are integrating de-dupe as part of the archiving solutions.

De-duplication is technology that can go beyond the backup and archiving environments. Data reduction is desired functionality in areas such as replication. De-duplication in the replication market is untapped opportunity. It can reduce data to be sent over to remote site in the remote replication and increases the performance of asynchronous mirroring application. Due to early stage of the technology, Replication vendors haven’t fully embraced De-duplication.

Many De-duplication solutions focused on data reduction by focusing data changes over time which can be referred as Temporal de-duplication. NetApp introduced Single-Instance-Storage de-duplication solution that takes advantage commonality of data within storage which can be referred as Spatial de-duplication. The spatial de-duplication is relatively very new concept and potential area for vendors to tap new opportunities. The spatial de-duplication could reduce primary as well as secondary storage needs for certain applications.

De-duplication is game changing technology in coming 3-5 years. Even though it is getting traction in backup environments, the technology has potential in many areas where data reduction is desired. Both temporal and spatial de-duplication has advantages applicable in certain application environments.

No comments: