One way synching
Two way synchronization
Synchronising
Variants
Configuration and usage
What is not working

ftpsync - synchronize files

ftpsync synchronizes a local directory to a directory on an FTP server.

More precisly ftpsync does bi-directional syncing between client and server with simple conflict resolution. Syncing is also not limited to one client and one server. ftpsync supports multiple nodes syncing to the same server and/or syncing to multiple servers.

This text explains some ideas behind the script and how it's configured and used.

Requirements

The following things are required for ftpsync:

Client side
You need gawk version 3.1.0 or above, connect version 1.0.2 or above and the gawk file extension file version 1.0.0 or above.

Server side
ftpsync can work with any FTP server that support the usual FTP commands as given in RFC 959 and additionally the SIZE and MDTM commands.

The server side requirements are pretty common these days. You can expect your server to fulfill them.

On the client side you should first check with a "gawk --version" if you have gawk-3.1.x already installed or not. If not download it from www.gnu.org an install it.

The other two pre-requisites can be found on www.awk-scripting.de. They are simple to install. After you have compiled awk.file.so copy it to /usr/local/lib to install. If you decide on another location you'll have to edit the ftpsync script. connect is compiled and installed under /usr/local/bin by a "make install".

One way synching

Let's look at the basic "theory of operation". There is a directory on the local machine and another on a remote server. In the first run all local files will be copied to the remote directory. In each later run only the files that have been modified after the previous run should be copied to the remote system. That is the syncer copies only those files that have to be copied to update the remote end.

Obviously our syncer has to keep track of each file's size and modification date after it has been copied to the remote server. If the syncer finds in a later run that a file's size or modification date has changed it was modified and has to be copied again.

If we store or current status information in a text file, with one line per file in the format "filename <tab> size <blank> mtime" the following function readsyncinfo reads the previous (!) file information into an associative array.

function readsyncinfo(filename, list,   line, x) {
	while (getline line <filename > 0) {
		split(line, x, /\t+/);
		list[x[1]] = x[2];
		}

	close(filename);
	return (0);
	}

Computing the current file information means getting the file's information from it's directory entry. The following code reads this information for all files in the current directory into the currentnode array.

dirlist[""] = sbuf[""] = "";
n = scandir(".", dirlist);
for (i=1; i<=n; i++) {
	if (stat(dirlist[i], sbuf) != 0)
		dirlist[i] = "";
	else if (sbuf["type"] != "file")
		continue;

	if (excludefile(dirlist[i]))
		continue;

	currentnode[dirlist[i]] = sbuf["size"] " " sbuf["mtime"];
	}

It's perhaps worth noting here that we do not have to compare the size and modification time values individually. We can compare directly a file's currentnode value against it's previous file information we obtained with the readsyncinfo function.

Consider we used "readsyncinfo(..., previousnode)" (ignore the filename parameter for the moment) to read the stored file information then we can compute easily the files's sync status:

for (file in prevnode) {
	if (! (file in currentnode))
		status[file] = "deleted";
	else if (prevnode[file] == currentnode[file])
		status[file] = "unchanged";
	else
		status[file] = "changed";
	}

The comparison "prevnode[file] == currentnode[file]" decides if file needs to be updated or not. If file's size and/or modification time is different, file's values in the prevnode and currentnode arrays differ and therefore it's status is set to changed.

Another thing that the above code does is it determines if a file was deleted. In this case we have the file's entry in our previousnode array but not in currentnode. And, to be complete, we have also to do

for (file in currentnode) {
	if (! (file in prevnode))
		status[file] = "changed";
	}

to assign the changed status to new files.

Ok, let's wait a while and think about it. What we now have is something for mirroring. We can compute what happened to our files and if we have to update or delete them on our FTP server or not. This is interesting but it's not syncing.

Two way synchronization

For true synchronization we have to consider our other's end. Files might also be modified or deleted there. In this case we have to update our local files by either getting the updated file from the server or deleting our local copy.

Basically we do for the remote server the same thing we did for our local files. That is, we keep track of the file's sizes and modification times and we retrieve the current file information from the server to compute the remote file's status. Only the way how we get the current file information is different since we cannot simple stat() the files.

Since I didn't want to deal with the FTP server's LIST format I implemented this using NLST, SIZE and MDTM:

function readserverinfo(ftpd, dir, list,   file, line, data, dirlen) {
	delete list;

	#
	# Retrieve the list of all files and directories ...
	#

	portcmd = doport(ftpd, "");

	cfputc(ftpd, "NLST", ".", 150);
	portcmd |& getline line;

	dirlen = 0;
	if (dir != "") {
		dir = dir "/";
		dirlen = length(dir);
		}

	while (portcmd |& getline file) {
		file = noctrl(file);
		if (dirlen > 0  &&  substr(file, 1, dirlen) == dir)
			file = substr(file, dirlen + 1);

		if (excludefile(file))
			continue;

		list[file] = "";
		}

	close (portcmd);
	cfputc(ftpd, "", "", 226);


	#
	# ... and collect SIZE and MDTM for each file.
	#

	for (file in list) {
		if ((line = cfputc(ftpd, "SIZE", file, -213))+0 != 213) {
			delete list[file];
			continue;
			}
		else {
			sub(/^[^ \t]+[ \t]+/, "", line);
			data = line;
			}

		if ((line = cfputc(ftpd, "MDTM", file, -213))+0 != 213) {
			delete list[file];
			continue;
			}
		else {
			sub(/^[^ \t]+[ \t]+/, "", line);
			data = data " " line;
			}

		list[file] = data;
		}

	return (0);
	}

Equipped with this information we can compute the status (changed, unchanged or deleted) for each of the remote files.

Synchronising

Now that we have all the information we need we can define what to do depending the the local's and remote's file status.


local/remote unchanged changed deleted doesn't exist
unchanged nothing get remove put
changed put duplicate put put
deleted remove get ignore ignore
doesn't exist get get ignore ignore

To explain this table: the rows show the local and the cols the remote file status, the values inside the table are the actions as seen by the node running ftpsync. E.g. get means that the file is retrieved from the FTP server. The remove action's usage (twice) is not exact: is doesn't say if the file has to be deleted locally or on the server.

The "doesn't exist" status means that a certain file does not exist on one side, neither as file in the current directory nor in the previous status file. This happens if a file is created on one side the directories have not been syncronised. The usual action is then to copy the file to the other side.

It's perhaps more difficult if the file does not exist on one side and is on the delete list of the other side. Under usual circumstances this can not happen but the synchronizer has to deal with it. Well it's simple, we have to delete a file that is already deleted on one end and does not exist on the other. The right action is to ignore it. The other ignores refer also to situation where nothing has to be done but (in opposite to the unchanged/unchanged nothing) it's not clear how the system entered this state.

Version conflicts

More interesting is the duplicate action. In this case we have two changed copies, one local and one on the server. What now? How can this conflict be resolved, which copy wins? The answer is that both win. If a duplicate situation is recognized the server's file is retrieved but the server's node name is appended to the filename to show that this file is the server copy. The server receives the local file but again the name is modified. This time the peer's name is appended to the filename. In other words: both sides keep their copy and receive the other end's version with a different filename. It's then up to the user to decide which of the versions is better. These conflict resolution files are not versioned, they are overwritten on the next conflict situation.

Symmetry

The action table above is symmetric. This means that none of the sides is prefered. Basically both sides could run the synchroniser, changing client and server role. The conflict resolver is also symmetric, more than this: it's "multi-symmetric". If you have a given number of nodes syncronising with the same server each node has it's own conflict resolution which does not interfere with another nodes resolution. The only additional requirement is that each node has it's own unique name.

Variants

There are some possible modifications to the action table above. The unchanged/deleted/remove (abbreviated udR) could be changed to udP (put instead of remove) and ccD could become ccP. With this two changes the synchronizer becomes a simple backup program. Backup program because files that need to be stored on the server are uploaded (files that are deleted or changed on the server are refreshed) to the server and simple because we have no file versioning.


local/remote unchanged changed deleted doesn't exist
unchanged nothing get put put
changed put put put put
deleted remove get ignore ignore
doesn't exist get get ignore ignore

Changing the symmetric entries duR and ccD to duG and ccG would make the system running ftpsync the FTP server's simple backup system. I call these two modes "master" and "slave" mode.

The synchronizer can also be downgraded to a mirror (FTP server to local) program by changing by changing the action table to the following.


local/remote unchanged changed deleted doesn't exist
unchanged nothing get remove ignore
changed get get remove ignore
deleted get get ignore ignore
doesn't exist get get ignore ignore

I call this the "mirror" mode. If we apply the symmetric changes to the action table we get the "original" (mirror local to FTP server) mode.

Another thing that could be considered is how file removals are done. Instead of deleting them they could be moved into a .deleted folder, with or without versioning.

Configuration and usage

ftpsync needs a configuration file to synchronize a directory. This file is usually names .sync.conf and located in the directory that should be sync'ed. The file has the typical UN*X-style: comments, starting with a "#", are allowed, empty lines too. The other lines are of the form "key value" with whitespace between key and value.

The configuration parameters are:


key Description
nodename The name of the local host running ftpsync. This doesn't have to the host's DNS hostname, it can be anything as long as it's unique among all nodes syncing to the same server location. You should use only letters, digits and dashes (minus signs) here.
"name" and "node" are possible aliases for "nodename".
peer The FTP server's name. Again you don't have to enter the server's DNS name here (although you can). Choose any name you like as long as it contains only letters, digits and dashes.
"peername" and "remote" are aliases for "peer".
server This is either the peer's full qualified domain name or it's IP number.
login The login on the FTP server.
password The password belonging to login.

Optional Parameters
dir The directory on the FTP server to which you want to syncronize to. If unset the login's home directory is used.
includedots Can be "yes" or "no". If set to "yes" files beginning with a dot will be also subject to sychronization. File beginning with ".sync." or ".sync_" are still excluded.
allowblanks Can be "yes" or "no". If set to "yes" files that have blanks in their names are also synchronzed.
mode This value defines ftpsync's default operation mode. It can be one of "sync", "master", "slave", "original" or "mirror". The default value is "sync".
symsync Can be "yes" or "no". If set to "yes" the file states are copied and swapped to the server. With "syssync" set to "yes" client and server can swap roles in later ftpsync runs.

An example for a configuration file is

#
# .sync.conf - ftpsync configuration file.
#

nodename	pc
remote		server

server		192.168.0.4
login		my-ftp-account
password	my-secret-password

dir		sync
includedots	yes
allowblanks	no
mode		sync
symsync		yes

Invocation

ftpsync [options] directory [server]
synchronizes directory with the FTP server configured in directory/.sync.conf. If the optional server argument is given the file directory/.sync-server.conf is used instead.

ftpsync supports the following command line options:

-a
usually ftpsync prints only a line for each file that was modified on either side. With the -a option a line for every file is printed, even the unmodified files.

-d
debug-mode: this option makes ftpsync printing the whole FTP conversation to it's stderr.

-l
list-only: ftpsync will only list what would be done without doing it.

-p password
set's the FTP server's password on the command line. The -p option overrides the password given in the configuration file.

-s mode
changes the sync mode to mode which can be on of: sync, master, slave, original or mirror to use a diffenent action matrix (see above). This turns automatically list-only mode on. Add the -y option to make a sync run the given mode.

-y
ftpsync switches automatically to the list-only mode if the sync mode is given with the -s option on the command line. The option -y must be added to confirm this.

What is not working

ftpsync does not recurse into subdirectories. The reason for this (beyond the additional complexity which could be implemented) is that ftpsync is yet not able to determine if something on the FTP server is a regular file or a directory. The way how ftpsync decides that something is a normal file or not is simple. If the SIZE and MDTM calls succeed one a remote name it is a file, otherwise it's not. This assumes that either SIZE or MDTM fails for non-regular files and this seems to be true for the normal FTP server (but this assumption might be the reason why ftpsync does not work in your particular setup).

Now let's look at the case when either SIZE and/or MDTM fails. Is the remote object then a directory, a symbolic link, a device file or something else? This can't be determined with the current implementation. Ok, I know there are clever FTP clients that are able to parse the LIST format of a dozen of different FTP server types, but I don't think that this is the right way to deal with this question. ftpsync should be instead rewritten to work with the (hopefully) upcoming MLST/MLSD standard commands because with these the file type can be requested (together with other information) in a defined machine readable format. But notice that these rewrite implies that your server implements and understands these command which may be not true.

Originally I wrote ftpsync to synchronize files between two wiki server, running one on my linux computer at home and the other on my Internet server. Suprisingly it doesn't work out of the box. Not because ftpsync is not fully working. It's a problem of permissions and file ownership.

If I sync my local files to the remote server they are owned by me (or better: my remote FTP account). If then the HTTP server runs with a different user id, which is absolutly common, the HTTP server can't write the files. Or consider that the wiki script on the HTTP server creates a new file. When I log in with my regular FTP account I will not be able to overwrite or delete these files because of missing permissions. The only way how to really solve this situation is either log in to the FTP server under the HTTP server's account or changing the server's user id to my FTP account. But I don't expect that the average provider will configure either of these solutions.