How to make a block-level incremental backup using LVM?

Fri Dec 14 16:04:12 UTC 2012

Hi Alan,

>> Commercial tools promise this ability. How do they get the block-to-file
>> mapping to do the restore? I was looking for a way to do that so I could
>> do the same using LVM snapshots.
> you cannot go block to file. To start with when restoring the block may
> already have been reused for another file.

I suppose they use something like inotify (or their own virtual file 
system driver over a real file system, like NFS or a loop fs) to learn 
about changed blocks, but they find to which file each block belongs to 
and salve this info in their backup catalog. If the changed block is 
filesystem (or md device, or lvm) metadata, they have to understand this 
and eithert log the change apropriately or ignore it as it's not file data.

I can imagine something like this working and even how to program. And 
I'm a little scared about some backup tool being monitoring my file 
accesses all the time. ;-)

So I won't find anything similar from open source tools, not even a 
kernel API to help me if I want to implement myself?

> You can go file to block list, but thats only for some file systems and
> not really reliable except for an unmounted snapshot.
As far as the goal is to capture the data, I can't see why it couldn't 
be made in a realiable way. I'm not saying it would be trivial. But all 
file changes have to go though the kernel, even if they are kept in 
memory before going to the disk, so it should be possible for a daemon 
to be notified about all changes and get the data. It's just a matter of 
having a kernel API. I suppose inotify would be it.

>> But LVM snapshots are a "whole" disk. If I try to backup them using dd
>> or rsync, they are the same as a full backup. How to backup just the
>> snapshot changed blocks and later restore them (of course after
>> restoring the full volume, or to a mirror)?
> What the snapshot gives you is an atomic copy of the file system so you
> can do a full file system copy, or backup the snapshot without the stuff
> underneath changing. It's basically a way to get an unmounted, out of use
> copy cheaply that you can then use for stuff.
No questions about this. I want to move further. Doing a dump or a rsync 
from a snapshot of a multiple TB filesystem is the same as doing to the 
original volume. I want to devise a way to do this in a faster way 
without sacrificing realiability.

> Correct - the only way to check any copy is valid is by comparing the 
> original to the copy. That in fact (plus clever magic) is how rsync 
> works, so in effect the way to check if an rsync copy is valid is to 
> try and rsync it again. Doing a set of sha or md5sums on the two sides 
> and comparing the output now and then ought to provide a further check. 
More time spent in what's already too slow. There could be a rsync or 
drdb tool that calculates, stores and sends hashes on-the-fly, so the 
remote copy could be checked per se.

>> There has to be a better way to restore a few TB of backup consisting of
>> lots of small files. :-(
> Is the issue backing up or restoring ?
The main issue is backing up every day, even many times a day. But for 
me there's no value in a speedy backup which I cannot restore reliably, 
not just from the computer standpoint. Someone (people) has to find 
which backup sets are needed to do the restore. They need to be able to 
check these backup sets before or during the restore.

>   If it is backing up then it may be
> possible to work out which blocks are different between two snapshots and
> transfer just those.

How? Anyone on the list can provide hints?

> I don't know the innards of the LVM layer well
> enough to know if there is a clever way to do that. I'm also not sure it
> would help if the blocks are scattered about as it would still be a lot
> of seeking.
That "clever way" seems to be what commercial tools promise, but they 
don't tell me what they use: which kernel API, their own driver, or if 
they work only this or that network storage... :-( I don't trust 
anything I can't understand how it works. All "magical" solutions I 
found previoulsy proved to be no solution at all.

I'm seeing the file three walk is taking too long, just to find that 
most files weren't changed, even relying on last modification time, that 
if I could get a list of blocks to back up it should be faster (less 
disk seeks).

It shouldn't be too hard to implement a deamon using inotify and some 
queueing strategy to deal with changed file blocks, add metadara, then 
compress and send elsewhere. On the same machine, if I read the changed 
block I should get it's correct data, even if they weren't synced yet. 
But I can't find anoyne who did as open source, so maybe threre's some 
problem I could not see yet. And I'd take too long to implement and 
debug myself alone. Any developers out there seeking for alfa testers 
for their new, revolutionary, backup tool ;-)

[]s, Fernando Lozano