The Dedup Dilemma



The VMdamentals blog has moved!

This blogpost has been moved

here

Please update your bookmarks!



Advertisements

Tags: , , ,

7 Responses to “The Dedup Dilemma”

  1. Tom Says:

    I would REALLY appreciate your comments on esXpress de-dup, since it appears to be a challenge to get it working right etc., and I bought it with this idea in mind for possibly making offsite backups even possible.

    Please also comment about bandwidth and other requirements for offsite backups etc. with esXpress.

  2. erikzandboer Says:

    From what I’ve seen from esXpress dedupe, it works without issues. Remember there are no more delta and full backups, but PHDD-type backups. You must config esXpress to run PHDD backups, and you have to configure a PHDD backup target to match. I would recommend to post your issues on the esXpress forum ( http://www.phdvirtual.com/forums?func=showcat&catid=13 ). I’m sure Pete or someone else from PHD will be able to resolve your issues in a snap!

    Stay tuned for a series of blogposts I have planned on esXpress 3.5!

  3. Tom Says:

    I already use esXpress 3.1.21…I know the support is good, etc.
    I also check the forums and I see people having issues so I’m waiting a while.
    My actual request was that you comment about bandwidth issues vis-a-vis offsite backups with the dedupe method.
    3.5 will do *either* phdd OR full/delta backups, I think it does not allow you to do both kinds.
    Thank you, Tom

  4. erikzandboer Says:

    This is not exactly the right place to comment on this (I will write more about this in the planned blogposts). Anyway, Dedup in esXpress appears to be source-dedup. This would mean only “new” blocks hit the WAN. Basically the changerate of data on the remote site would greatly influence bandwidth. You could possibly use sub-10Mbit lines for daily backups. Not tested this enough though. Soon to come!

  5. Tom Says:

    That is what I meant — please comment etc. about it in your forthcoming blog entries.

    It will help a lot of people if you talk more about the changerate, how to determine what it might possibly be, etc. Most SMBs *only* have <<10 Mbit lines.

    Thank you, Tom

  6. DarkFlib Says:

    You aren’t 100% correct with regards to the use of hashes. I’m sure there are probably some companies doing what you say, but many write the blocks to disk then de-dup during idle periods. The only use the hashes to find candidates for the operation then do a full byte-wise comparison of the source and destination blocks.

    I also don’t see any reason why this also wouldn’t be done with inline/online de-dup, since even if a read is required to compare, with the right hash algorithm, collisions should be fairly rare and as such a block write after the comparison should also be correspondingly rare. This gives a net result (if we ignore the hashing operation) of replacing each write with a read, which on a RAID array is generally far faster than write it replaces.

    I don’t know about you, but the uncertainty over the technology is what causes me to avoid it at the current time, although I do use some filesystem level tools to do similar things (rsnapshot/fdup etc)

    The biggest downside I see to both block level de-dup (filesystem level doesn’t have this issue) and its cousin ‘thin-provisioning’ (the filesystem equivalent ‘sparse’ files is also painful in this respect) is that you can never be sure just how much free space you actually have in the array, you can only guess based on past performance; if something changes these projections can be thrown right out the window.

  7. erikzandboer Says:

    DarkFlib,

    a full byte-wise comparison cannot always be done. Apart from being very intensive (read: slow), you cannot use this option when you do source-based dedup. And source dedup is the way to save bandwidth. You are referring to destination dedup, in which you would end up sending all data over the network, and dedup and the central storage (so you would need a lot of CPU there). You are correct that some vendors do “offline dedup”, but that requires a lot of storage at busy times, and of course you need to have a shop where there are idle times to begin with.

    “fairly rare” as you describe, is not acceptable. One dedup error could be fatal for all your VM backups. It is just unacceptable. Fortunately, really smart mathematicians have made very cool algorithms which make collisions VERY rare indeed. In fact SO rare, that a “collision” will not occur within a human lifetime… Even EMCs Centerra archivers use algorithms like these (not for block data, but for detecting identical entries), and they are SERIOUS about not loosing your data!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: