Tuesday, November 20, 2012

Data Deduplication

 

Data Deduplication

It eliminates the  of data and offer bandwidth optimization and storage space savings.

Environment it suits for;

  • File systems
  • Low rate Databases
  • Virtualizations
  • LAN/ SAN
  • Archive
  • Backup ( where more redundant data are there )

Deployment:
       Source ( Before ):
        Filesystem
        Virtualization
         Remote office Branch Office

      Target  ( After ):

        Database
        LAN / SAN

In computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data) compression and single-instance (data) storage. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced

For example a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.

Why deduplicate data?

Eliminating redundant data can significantly shrink storage requirements and improve bandwidth efficiency. Because primary storage has gotten cheaper over time, enterprises typically store many versions of the same information so that new work can re-use old work. Some operations like Backup store extremely redundant information. Deduplication lowers storage costs since fewer disks are needed, and shortens backup/recovery times since there can be far less data to transfer. In the context of backup and other nearline data, we can make a strong supposition that there is a great deal of duplicate data. The same data keeps getting stored over and over again consuming a lot of unnecessary storage space (disk or tape), electricity (to power and cool the disk or tape drives), and bandwidth (for replication), creating a chain of cost and resource inefficiencies within the organization

How does data deduplication work?

Deduplication segments the incoming data stream, uniquely identifies the data segments, and then compares the segments to previously stored data. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again, but a reference is created to it. If the segment is unique, it is stored on disk.

For example, a file or volume that is backed up every week creates a significant amount of duplicate data. Deduplication algorithms analyze the data and can store only the compressed, unique change elements of that file. This process can provide an average of 10-30 times or greater reduction in storage capacity requirements, with average backup retention policies on normal enterprise data. This means that companies can store 10TB to 30TB of backup data on 1 TB of physical disk capacity, which has huge economic benefits.

In-line deduplication

This is the process where the deduplication hash calculations are created on the target device as the data enters the device in real time. If the device spots a block that it already stored on the system it does not store the new block, just references to the existing block. The benefit of in-line deduplication over post-process deduplication is that it requires less storage as data is not duplicated. On the negative side, it is frequently argued that because hash calculations and lookups takes so long, it can mean that the data ingestion can be slower thereby reducing the backup throughput of the device. However, certain vendors with in-line deduplication have demonstrated equipment with similar performance to their post-process deduplication counterparts.

Target deduplication is the process of removing duplicates of data in the secondary store. Generally this will be a backup store such as a data repository or a virtual tape library

  • File Level Deduplication
  • Block Level Deduplication

No comments: