Science Behind Data Deduplication


Of all the storage technologies emerging today, data deduplication is gaining the most attention. Different from compression, data deduplication identifies and gets rid of redundant data, saving companies huge amounts of capacity – translating into big dollar savings.

So, how does this technology work? Let's look under the hood and explore the science behind data deduplication.


Understanding Data Deduplication

Data deduplication, sometimes called "intelligent compression" or "single-instance storage,” is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy.

For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only one MB.

When byte level changes are made to a file, the difference in the hash between the original file and the modified file automatically creates a second copy of the file.  So that means if you change the title on a 10 MB PDF file, even if the change takes up a couple of bytes, your file share winds up holding 20 MBs worth of content.

Data deduplication technologies keep track of minor changes and can create versions on the fly by reapplying the differences.  For example if you have a single 10 MB presentation that needs to be sent out to 10 different customers, changing the names and addresses to personalize the presentations will not eat up 100 MBs of space, even if you save a version for each customer.

When a unique segment is written to the file system, a hash is associated with it and knowledge of it is kept in a repository.  If a recognized segment is sent to the file share, the data deduplication software acknowledges it but doesn't write the segment to storage.

So when a user tries to pull up her version of the file, the data deduplication software rebuilds the file from the stored segments on the server.

Data deduplication can generally operate at the file, block, and even the bit level. File deduplication eliminates duplicate files (as in the examples above), but this is not a very efficient means of deduplication. Block and bit deduplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved, the changes don't constitute an entirely new file. This behavior makes block and bit deduplication far more efficient.

Data deduplication is especially powerful when it's applied to backup, because most backup data sets have a great deal of redundancy. It's common to see a backup appliance with data deduplication technology holding 10 to 50 times more backup data than a conventional disk storage product. The advantage depends on the data being backed up, the backup methodology, and the length of time data is retained.


Benefits of Data Deduplication

Data deduplication offers several important benefits. Storage costs are reduced because less storage is needed. This means fewer disks and less frequent disk purchases. Less data also means smaller backups. This translates into smaller backup windows and faster recovery time objectives (RTO). The smaller backups also allow for longer retention times on virtual tape libraries (VTL) or archives.

Call Hardwyre today at 501.851.2880 to learn more about how data deduplication technology can help your company increase storage space, boost backup efficiency, and save time and money.