mirror of
https://github.com/RsyncProject/rsync.git
synced 2026-06-08 14:15:46 -04:00
Remove obsolete design notes
rsync3.txt and rsyncsh.txt are Martin Pool's 2001 design proposals
("notes towards a new version of rsync", an interactive rsync shell),
neither of which reflects the current implementation. doc/profile.txt is
stale profiling notes. None are referenced by the build, tests, or docs.
This commit is contained in:
@@ -1,42 +0,0 @@
|
||||
Notes on rsync profiling
|
||||
|
||||
strlcpy is hot:
|
||||
|
||||
0.00 0.00 1/7735635 push_dir [68]
|
||||
0.00 0.00 1/7735635 pop_dir [71]
|
||||
0.00 0.00 1/7735635 send_file_list [15]
|
||||
0.01 0.00 18857/7735635 send_files [4]
|
||||
0.04 0.00 129260/7735635 send_file_entry [18]
|
||||
0.04 0.00 129260/7735635 make_file [20]
|
||||
0.04 0.00 141666/7735635 send_directory <cycle 1> [36]
|
||||
2.29 0.00 7316589/7735635 f_name [13]
|
||||
[14] 11.7 2.42 0.00 7735635 strlcpy [14]
|
||||
|
||||
|
||||
Here's the top few functions:
|
||||
|
||||
46.23 9.57 9.57 13160929 0.00 0.00 mdfour64
|
||||
14.78 12.63 3.06 13160929 0.00 0.00 copy64
|
||||
11.69 15.05 2.42 7735635 0.00 0.00 strlcpy
|
||||
10.05 17.13 2.08 41438 0.05 0.38 sum_update
|
||||
4.11 17.98 0.85 13159996 0.00 0.00 mdfour_update
|
||||
1.50 18.29 0.31 file_compare
|
||||
1.45 18.59 0.30 129261 0.00 0.01 send_file_entry
|
||||
1.23 18.84 0.26 2557585 0.00 0.00 f_name
|
||||
1.11 19.07 0.23 1483750 0.00 0.00 u_strcmp
|
||||
1.11 19.30 0.23 118129 0.00 0.00 writefd_unbuffered
|
||||
0.92 19.50 0.19 1085011 0.00 0.00 writefd
|
||||
0.43 19.59 0.09 156987 0.00 0.00 read_timeout
|
||||
0.43 19.68 0.09 129261 0.00 0.00 clean_fname
|
||||
0.39 19.75 0.08 32887 0.00 0.38 matched
|
||||
0.34 19.82 0.07 1 70.00 16293.92 send_files
|
||||
0.29 19.89 0.06 129260 0.00 0.00 make_file
|
||||
0.29 19.95 0.06 75430 0.00 0.00 read_unbuffered
|
||||
|
||||
|
||||
|
||||
mdfour could perhaps be made faster:
|
||||
|
||||
/* NOTE: This code makes no attempt to be fast! */
|
||||
|
||||
There might be an optimized version somewhere that we can borrow.
|
||||
467
rsync3.txt
467
rsync3.txt
@@ -1,467 +0,0 @@
|
||||
-*- indented-text -*-
|
||||
|
||||
Notes towards a new version of rsync
|
||||
Martin Pool <mbp@samba.org>, September 2001.
|
||||
|
||||
|
||||
Good things about the current implementation:
|
||||
|
||||
- Widely known and adopted.
|
||||
|
||||
- Fast/efficient, especially for moderately small sets of files over
|
||||
slow links (transoceanic or modem.)
|
||||
|
||||
- Fairly reliable.
|
||||
|
||||
- The choice of running over a plain TCP socket or tunneling over
|
||||
ssh.
|
||||
|
||||
- rsync operations are idempotent: you can always run the same
|
||||
command twice to make sure it worked properly without any fear.
|
||||
(Are there any exceptions?)
|
||||
|
||||
- Small changes to files cause small deltas.
|
||||
|
||||
- There is a way to evolve the protocol to some extent.
|
||||
|
||||
- rdiff and rsync --write-batch allow generation of standalone patch
|
||||
sets. rsync+ is pretty cheesy, though. xdelta seems cleaner.
|
||||
|
||||
- Process triangle is creative, but seems to provoke OS bugs.
|
||||
|
||||
- "Morning-after property": you don't need to know anything on the
|
||||
local machine about the state of the remote machine, or about
|
||||
transfers that have been done in the past.
|
||||
|
||||
- You can easily push or pull simply by switching the order of
|
||||
files.
|
||||
|
||||
- The "modules" system has some neat features compared to
|
||||
e.g. Apache's per-directory configuration. In particular, because
|
||||
you can set a userid and chroot directory, there is strong
|
||||
protection between different modules. I haven't seen any calls
|
||||
for a more flexible system.
|
||||
|
||||
|
||||
Bad things about the current implementation:
|
||||
|
||||
- Persistent and hard-to-diagnose hang bugs remain
|
||||
|
||||
- Protocol is sketchily documented, tied to this implementation, and
|
||||
hard to modify/extend
|
||||
|
||||
- Both the program and the protocol assume a single non-interactive
|
||||
one-way transfer
|
||||
|
||||
- A list of all files are held in memory for the entire transfer,
|
||||
which cripples scalability to large file trees
|
||||
|
||||
- Opening a new socket for every operation causes problems,
|
||||
especially when running over SSH with password authentication.
|
||||
|
||||
- Renamed files are not handled: the old file is removed, and the
|
||||
new file created from scratch.
|
||||
|
||||
- The versioning approach assumes that future versions of the
|
||||
program know about all previous versions, and will do the right
|
||||
thing.
|
||||
|
||||
- People always get confused about ':' vs '::'
|
||||
|
||||
- Error messages can be cryptic.
|
||||
|
||||
- Default behaviour is not intuitive: in too many cases rsync will
|
||||
happily do nothing. Perhaps -a should be the default?
|
||||
|
||||
- People get confused by trailing slashes, though it's hard to think
|
||||
of another reasonable way to make this necessary distinction
|
||||
between a directory and its contents.
|
||||
|
||||
|
||||
Protocol philosophy:
|
||||
|
||||
*The* big difference between protocols like HTTP, FTP, and NFS is
|
||||
that their fundamental operations are "read this file", "delete
|
||||
this file", and "make this directory", whereas rsync is "make this
|
||||
directory like this one".
|
||||
|
||||
|
||||
Questionable features:
|
||||
|
||||
These are neat, but not necessarily clean or worth preserving.
|
||||
|
||||
- The remote rsync can be wrapped by some other program, such as in
|
||||
tridge's rsync-mail scripts. The general feature of sending and
|
||||
retrieving mail over rsync is good, but this is perhaps not the
|
||||
right way to implement it.
|
||||
|
||||
|
||||
Desirable features:
|
||||
|
||||
These don't really require architectural changes; they're just
|
||||
something to keep in mind.
|
||||
|
||||
- Synchronize ACLs and extended attributes
|
||||
|
||||
- Anonymous servers should be efficient
|
||||
|
||||
- Code should be portable to non-UNIX systems
|
||||
|
||||
- Should be possible to document the protocol in RFC form
|
||||
|
||||
- --dry-run option
|
||||
|
||||
- IPv6 support. Pretty straightforward.
|
||||
|
||||
- Allow the basis and destination files to be different. For
|
||||
example, you could use this when you have a CD-ROM and want to
|
||||
download an updated image onto a hard drive.
|
||||
|
||||
- Efficiently interrupt and restart a transfer. We can write a
|
||||
checkpoint file that says where we're up to in the filesystem.
|
||||
Alternatively, as long as transfers are idempotent, we can just
|
||||
restart the whole thing. [NFSv4]
|
||||
|
||||
- Scripting support.
|
||||
|
||||
- Propagate atimes and do not modify them. This is very ugly on
|
||||
Unix. It might be better to try to add O_NOATIME to kernels, and
|
||||
call that.
|
||||
|
||||
- Unicode. Probably just use UTF-8 for everything.
|
||||
|
||||
- Open authentication system. Can we use PAM? Is SASL an adequate
|
||||
mapping of PAM to the network, or useful in some other way?
|
||||
|
||||
- Resume interrupted transfers without the --partial flag. We need
|
||||
to leave the temporary file behind, and then know to use it. This
|
||||
leaves a risk of large temporary files accumulating, which is not
|
||||
good. Perhaps it should be off by default.
|
||||
|
||||
- tcpwrappers support. Should be trivial; can already be done
|
||||
through tcpd or inetd.
|
||||
|
||||
- Socks support built in. It's not clear this is any better than
|
||||
just linking against the socks library, though.
|
||||
|
||||
- When run over SSH, invoke with predictable command-line arguments,
|
||||
so that people can restrict what commands sshd will run. (Is this
|
||||
really required?)
|
||||
|
||||
- Comparison mode: give a list of which files are new, gone, or
|
||||
different. Set return code depending on whether anything has
|
||||
changed.
|
||||
|
||||
- Internationalized messages (gettext?)
|
||||
|
||||
- Optionally use real regexps rather than globs?
|
||||
|
||||
- Show overall progress. Pretty hard to do, especially if we insist
|
||||
on not scanning the directory tree up front.
|
||||
|
||||
|
||||
Regression testing:
|
||||
|
||||
- Support automatic testing.
|
||||
|
||||
- Have hard internal timeouts against hangs.
|
||||
|
||||
- Be deterministic.
|
||||
|
||||
- Measure performance.
|
||||
|
||||
|
||||
Hard links:
|
||||
|
||||
At the moment, we can recreate hard links, but it's a bit
|
||||
inefficient: it depends on holding a list of all files in the tree.
|
||||
Every time we see a file with a linkcount >1, we need to search for
|
||||
another known name that has the same (fsid,inum) tuple. We could do
|
||||
that more efficiently by keeping a list of only files with
|
||||
linkcount>1, and removing files from that list as all their names
|
||||
become known.
|
||||
|
||||
|
||||
Command-line options:
|
||||
|
||||
We have rather a lot at the moment. We might get more if the tool
|
||||
becomes more flexible. Do we need a .rc or configuration file?
|
||||
That wouldn't really fit with its pattern of use: cp and tar don't
|
||||
have them, though ssh does.
|
||||
|
||||
|
||||
Scripting issues:
|
||||
|
||||
- Perhaps support multiple scripting languages: candidates include
|
||||
Perl, Python, Tcl, Scheme (guile?), sh, ...
|
||||
|
||||
- Simply running a subprocess and looking at its stdout/exit code
|
||||
might be sufficient, though it could also be pretty slow if it's
|
||||
called often.
|
||||
|
||||
- There are security issues about running remote code, at least if
|
||||
it's not running in the users own account. So we can either
|
||||
disallow it, or use some kind of sandbox system.
|
||||
|
||||
- Python is a good language, but the syntax is not so good for
|
||||
giving small fragments on the command line.
|
||||
|
||||
- Tcl is broken Lisp.
|
||||
|
||||
- Lots of sysadmins know Perl, though Perl can give some bizarre or
|
||||
confusing errors. The built in stat operators and regexps might
|
||||
be useful.
|
||||
|
||||
- Sadly probably not enough people know Scheme.
|
||||
|
||||
- sh is hard to embed.
|
||||
|
||||
|
||||
Scripting hooks:
|
||||
|
||||
- Whether to transfer a file
|
||||
|
||||
- What basis file to use
|
||||
|
||||
- Logging
|
||||
|
||||
- Whether to allow transfers (for public servers)
|
||||
|
||||
- Authentication
|
||||
|
||||
- Locking
|
||||
|
||||
- Cache
|
||||
|
||||
- Generating backup path/name.
|
||||
|
||||
- Post-processing of backups, e.g. to do compression.
|
||||
|
||||
- After transfer, before replacement: so that we can spit out a diff
|
||||
of what was changed, or kick off some kind of reconciliation
|
||||
process.
|
||||
|
||||
|
||||
VFS:
|
||||
|
||||
Rather than talking straight to the filesystem, rsyncd talks through
|
||||
an internal API. Samba has one. Is it useful?
|
||||
|
||||
- Could be a tidy way to implement cached signatures.
|
||||
|
||||
- Keep files compressed on disk?
|
||||
|
||||
|
||||
Interactive interface:
|
||||
|
||||
- Something like ncFTP, or integration into GNOME-vfs. Probably
|
||||
hold a single socket connection open.
|
||||
|
||||
- Can either call us as a separate process, or as a library.
|
||||
|
||||
- The standalone process needs to produce output in a form easily
|
||||
digestible by a calling program, like the --emacs feature some
|
||||
have. Same goes for output: rpm outputs a series of hash symbols,
|
||||
which are easier for a GUI to handle than "\r30% complete"
|
||||
strings.
|
||||
|
||||
- Yow! emacs support. (You could probably build that already, of
|
||||
course.) I'd like to be able to write a simple script on a remote
|
||||
machine that rsyncs it to my workstation, edits it there, then
|
||||
pushes it back up.
|
||||
|
||||
|
||||
Pie-in-the-sky features:
|
||||
|
||||
These might have a severe impact on the protocol, and are not
|
||||
clearly in our core requirements. It looks like in many of them
|
||||
having scripting hooks will allow us
|
||||
|
||||
- Transport over UDP multicast. The hard part is handling multiple
|
||||
destinations which have different basis files. We can look at
|
||||
multicast-TFTP for inspiration.
|
||||
|
||||
- Conflict resolution. Possibly general scripting support will be
|
||||
sufficient.
|
||||
|
||||
- Integrate with locking. It's hard to see a good general solution,
|
||||
because Unix systems have several locking mechanisms, and grabbing
|
||||
the lock from programs that don't expect it could cause deadlocks,
|
||||
timeouts, or other problems. Scripting support might help.
|
||||
|
||||
- Replicate in place, rather than to a temporary file. This is
|
||||
dangerous in the case of interruption, and it also means that the
|
||||
delta can't refer to blocks that have already been overwritten.
|
||||
On the other hand we could semi-trivially do this at first by
|
||||
simply generating a delta with no copy instructions.
|
||||
|
||||
- Replicate block devices. Most of the difficulties here are to do
|
||||
with replication in place, though on some systems we will also
|
||||
have to do I/O on block boundaries.
|
||||
|
||||
- Peer to peer features. Flavour of the year. Can we think about
|
||||
ways for clients to smoothly and voluntarily become servers for
|
||||
content they receive?
|
||||
|
||||
- Imagine a situation where the destination has a much faster link
|
||||
to the cloud than the source. In this case, Mojo Nation downloads
|
||||
interleaved blocks from several slower servers. The general
|
||||
situation might be a way for a master rsync process to farm out
|
||||
tasks to several subjobs. In this particular case they'd need
|
||||
different sockets. This might be related to multicast.
|
||||
|
||||
|
||||
Unlikely features:
|
||||
|
||||
- Allow remote source and destination. If this can be cleanly
|
||||
designed into the protocol, perhaps with the remote machine acting
|
||||
as a kind of echo, then it's good. It's uncommon enough that we
|
||||
don't want to shape the whole protocol around it, though.
|
||||
|
||||
In fact, in a triangle of machines there are two possibilities:
|
||||
all traffic passes from remote1 to remote2 through local, or local
|
||||
just sets up the transfer and then remote1 talks to remote2. FTP
|
||||
supports the second but it's not clearly good. There are some
|
||||
security problems with being able to instruct one machine to open
|
||||
a connection to another.
|
||||
|
||||
|
||||
In favour of evolving the protocol:
|
||||
|
||||
- Keeping compatibility with existing rsync servers will help with
|
||||
adoption and testing.
|
||||
|
||||
- We should at the very least be able to fall back to the new
|
||||
protocol.
|
||||
|
||||
- Error handling is not so good.
|
||||
|
||||
|
||||
In favour of using a new protocol:
|
||||
|
||||
- Maintaining compatibility might soak up development time that
|
||||
would better go into improving a new protocol.
|
||||
|
||||
- If we start from scratch, it can be documented as we go, and we
|
||||
can avoid design decisions that make the protocol complex or
|
||||
implementation-bound.
|
||||
|
||||
|
||||
Error handling:
|
||||
|
||||
- Errors should come back reliably, and be clearly associated with
|
||||
the particular file that caused the problem.
|
||||
|
||||
- Some errors ought to cause the whole transfer to abort; some are
|
||||
just warnings. If any errors have occurred, then rsync ought to
|
||||
return an error.
|
||||
|
||||
|
||||
Concurrency:
|
||||
|
||||
- We want to keep the CPU, filesystem, and network as full as
|
||||
possible as much of the time as possible.
|
||||
|
||||
- We can do nonblocking network IO, but not so for disk.
|
||||
|
||||
- It makes sense to on the destination be generating signatures and
|
||||
applying patches at the same time.
|
||||
|
||||
- Can structure this with nonblocking, threads, separate processes,
|
||||
etc.
|
||||
|
||||
|
||||
Uses:
|
||||
|
||||
- Mirroring software distributions:
|
||||
|
||||
- Synchronizing laptop and desktop
|
||||
|
||||
- NFS filesystem migration/replication. See
|
||||
http://www.ietf.org/proceedings/00jul/00july-133.htm#P24510_1276764
|
||||
|
||||
- Sync with PDA
|
||||
|
||||
- Network backup systems
|
||||
|
||||
- CVS filemover
|
||||
|
||||
|
||||
Conflict resolution:
|
||||
|
||||
- Requires application-specific knowledge. We want to provide
|
||||
policy, rather than mechanism.
|
||||
|
||||
- Possibly allowing two-way migration across a single connection
|
||||
would be useful.
|
||||
|
||||
|
||||
Moved files:
|
||||
|
||||
- There's no trivial way to detect renamed files, especially if they
|
||||
move between directories.
|
||||
|
||||
- If we had a picture of the remote directory from last time on
|
||||
either machine, then the inode numbers might give us a hint about
|
||||
files which may have been renamed.
|
||||
|
||||
- Files that are renamed and not modified can be detected by
|
||||
examining the directory listing, looking for files with the same
|
||||
size/date as the origin.
|
||||
|
||||
|
||||
Filesystem migration:
|
||||
|
||||
NFSv4 probably wants to migrate file locks, but that's not really
|
||||
our problem.
|
||||
|
||||
|
||||
Atomic updates:
|
||||
|
||||
The NFSv4 working group wants atomic migration. Most of the
|
||||
responsibility for this lies on the NFS server or OS.
|
||||
|
||||
If migrating a whole tree, then we could do a nearly-atomic rename
|
||||
at the end. This ties in to having separate basis and destination
|
||||
files.
|
||||
|
||||
There's no way in Unix to replace a whole set of files atomically.
|
||||
However, if we get them all onto the destination machine and then do
|
||||
the updates quickly it would greatly reduce the window.
|
||||
|
||||
|
||||
Scalability:
|
||||
|
||||
We should aim to work well on machines in use in a year or two.
|
||||
That probably means transfers of many millions of files in one
|
||||
batch, and gigabytes or terabytes of data.
|
||||
|
||||
For argument's sake: at the low end, we want to sync ten files for a
|
||||
total of 10kb across a 1kB/s link. At the high end, we want to sync
|
||||
1e9 files for 1TB of data across a 1GB/s link.
|
||||
|
||||
On the whole CPU usage is not normally a limiting factor, if only
|
||||
because running over SSH burns a lot of cycles on encryption.
|
||||
|
||||
Perhaps have resource throttling without relying on rlimit.
|
||||
|
||||
|
||||
Streaming:
|
||||
|
||||
A big attraction of rsync is that there are few round-trip delays:
|
||||
basically only one to get started, and then everything is
|
||||
pipelined. This is a problem with FTP, and NFS (at least up to
|
||||
v3). NFSv4 can pipeline operations, but building on that is
|
||||
probably a bit complicated.
|
||||
|
||||
|
||||
Related work:
|
||||
|
||||
- mirror.pl
|
||||
|
||||
- ProFTPd
|
||||
|
||||
- Apache
|
||||
|
||||
- BitTorrent -- p2p mirroring
|
||||
http://bitconjurer.org/BitTorrent/
|
||||
26
rsyncsh.txt
26
rsyncsh.txt
@@ -1,26 +0,0 @@
|
||||
rsyncsh
|
||||
Copyright (C) 2001 by Martin Pool
|
||||
|
||||
This is a quick hack to build an interactive shell around rsync, the
|
||||
same way we have the ftp, lftp and ncftp programs for the FTP
|
||||
protocol. The key application for this is connecting to a public
|
||||
rsync server, such as rsync.kernel.org, change down through and list
|
||||
directories, and finally pull down the file you want.
|
||||
|
||||
rsync is somewhat ill-at-ease as an interactive operation, since every
|
||||
network connection is used to carry out exactly one operation. rsync
|
||||
kind of "forks across the network" passing the options and filenames
|
||||
to operate upon, and the connection is closed when the transfer is
|
||||
complete. (This might be fixed in the future, either by adapting the
|
||||
current protocol to allow chained operations over a single socket, or
|
||||
by writing a new protocol that better supports interactive use.)
|
||||
|
||||
So, rsyncsh runs a new rsync command and opens a new socket for every
|
||||
(network-based) command you type.
|
||||
|
||||
This has two consequences. Firstly, there is more command latency
|
||||
than is really desirable. More seriously, if the connection cannot be
|
||||
done automatically, because for example it uses SSH with a password,
|
||||
then you will need to enter the password every time. We might even
|
||||
fix this in the future, though, by having a way to automatically feed
|
||||
the password to SSH if it's entered once.
|
||||
Reference in New Issue
Block a user