3 * Rename all symbols that are intended to be private to `rs__`
5 * Rename those that don't match either prefix.
7 * We have a few functions to do with reading a netint, stashing
8 it somewhere, then moving into a different state. Is it worth
9 writing generic functions for that, or would it be too confusing?
11 * Duplicate block handling. Currently duplicate blocks are included in
12 the signature, but we only put the first duplicate block in the
13 hashtable so the delta only includes references to the first block.
14 This can result in sub-optimal copy commands, breaking single large
15 copies with duplicate blocks into multiple copies referencing the
16 earlier copy of the block. However, this could also make patching use
17 the disk cache more effectively. This solution is probably fine,
18 particularly given how small copy instructions are, but there might be
19 solutions for improving copy commands for long runs of duplicate blocks.
21 * Optimisations and code cleanups;
23 scoop.c: Scoop needs major refactor. Perhaps the API needs
26 rsync.h: rs_buffers_s and rs_buffers_t should be one typedef?
28 * Just how useful is rs_job_drive anyway?
30 mdfour.c: This code has a different API to the RSA code in libmd
31 and is coupled with librsync in unhealthy ways (trace?). Recommend
34 * Don't use the rs_buffers_t structure.
36 There's something confusing about the existence of this structure.
37 In part it may be the name. I think people expect that it will be
38 something that behaves like a FILE* or C++ stream, and it really
39 does not. Also, the structure does not behave as an object: it's
40 really just a shorthand for passing values in to the encoding
41 routines, and so does not have a lot of identity of its own.
43 An alternative might be
45 result = rs_job_iter(job,
46 in_buf, &in_len, in_is_ending,
49 where we update the length parameters on return to show how much we
52 One technicality here will be to restructure the code so that the
53 input buffers are passed down to the scoop/tube functions that need
54 them, which are relatively deeply embedded. I guess we could just
55 stick them into the job structure, which is becoming a kind of
56 catch-all "environment" for poor C programmers.
60 * Plot lengths of each function
62 * Some kind of statistics on delta each day
66 * Include a version in the signature and difference fields
68 * Remember to update them if we ever ship a buggy version (nah!) so
69 that other parties can know not to trust the encoded data.
73 In fact, we can vary on several different variables:
75 * what signature format are we using
77 * what command protocol are we using
79 * what search algorithm are we using?
81 * what implementation version are we?
83 Some are more likely to change than others. We need a chart
84 showing which source files depend on which variable.
86 * Encoding implementation
88 * Join up signature commands
92 * Self-referential copy commands
94 Suppose we have a file with repeating blocks. The gdiff format
95 allows for COPY commands to extend into the *output* file so that
96 they can easily point this out. By doing this, they get
97 compression as well as differencing.
99 It'd be pretty simple to implement this, I think: as we produce
100 output, we'd also generate checksums (using the search block
101 size), and add them to the sum set. Then matches will fall out
102 automatically, although we might have to specially allow for
105 However, I don't see many files which have repeated 1kB chunks,
106 so I don't know if it would be worthwhile.
110 Suppose the new file just has data added to the end. At the
111 moment, we'll match everything but the last block of the old
112 file. It won't match, because at the moment the search block
113 size is only reduced at the end of the *new* file. This is a
114 little inefficient, because ideally we'd know to look for the
115 last block using the shortened length.
117 This is a little hard to implement, though perhaps not
118 impossible. The current rolling search algorithm can only look
119 for one block size at any time. Can we do better? Can we look
120 for all block lengths that could match anything?
122 Remember also that at the moment we don't send the block length
123 in the signature; it's implied by the length of the new block
124 that it matches. This is kind of cute, and importantly helps
125 reduce the length of the signature.
127 * State-machine searching
129 Building a state machine from a regular expression is a brilliant
130 idea. (I think *The Practice of Programming* walks through the
131 construction of this at a fairly simple level.)
133 In particular, we can search for any of a large number of
134 alternatives in a very efficient way, with much less effort than
135 it would take to search for each the hard way. Remember also the
136 string-searching algorithms and how much time they can take.
138 I wonder if we can use similar principles here rather than the
139 current simple rolling-sum mechanism? Could it let us match
140 variable-length signatures?
142 * Support gzip compression of the difference stream. Does this
143 belong here, or should it be in the client and librsync just have
144 an interface that lets it cleanly plug in?
146 I think if we're going to just do plain gzip, rather than
147 rsync-gzip, then it might as well be external.
149 * rsync-gzip: preload with the omitted text so as to get better
150 compression. Abo thinks this gets significantly better
151 compression. On the other hand we have to important and maintain
152 our own zlib fork, at least until we can persuade the upstream to
153 take the necessary patch. Can that be done?
157 It does get better compression, but at a price. I actually
158 think that getting the code to a point where a feature like
159 this can be easily added or removed is more important than the
160 feature itself. Having generic pre and post processing layers
161 for hit/miss data would be useful. I would not like to see it
162 added at all if it tangled and complicated the code.
164 It also doesn't require a modified zlib... pysync uses the
165 standard zlib to do it by compressing the data, then throwing
166 it away. I don't know how much benefit the rsync modifications
167 to zlib actually are, but if I was implementing it I would
168 stick to a stock zlib until it proved significantly better to
173 Will the GNU Lesser GPL work? Specifically, will it be a problem
174 in distributing this with Mozilla or Apache?
178 * Do we really need to require that signatures arrive after the
179 data they describe? Does it make sense in HTTP to resume an
180 interrupted transfer?
182 I hope we can do this. If we can't, however, then we should
183 relax this constraint and allow signatures to arrive before the
184 data they describe. (Really? Do we care?)
186 * Allow variable-length checksums in the signature; the signature
187 will have to describe the length of the sums and we must compare
188 them taking this into account.
192 * Just more testing in general.
194 * Test broken pipes and that IO errors are handled properly.
196 * Test files >2GB, >4GB. Presumably these must be done in streams
197 so that the disk requirements to run the test suite are not too
198 ridiculous. I wonder if it will take too long to run these
199 tests? Probably, but perhaps we can afford to run just one
200 carefully-chosen test.
202 * Fuzz instruction streams. <https://code.google.com/p/american-fuzzy-lop/>?
204 * Generate random data; do random mutations.
206 * Try different block lengths.
208 * Tests should fail if they can't find their inputs, or have zero
209 inputs: at present they tend to succeed by default.
211 * Test varying strong-sum inputs: default, short, long.
215 * If this code was to read differences or sums from random machines
216 on the network, then it's a security boundary. Make sure that
217 corrupt input data can't make the program crash or misbehave.
221 * How do we handle the large signatures required to support large
222 files? In particular, how do we choose an appropriate block size
223 when the length is unknown? Perhaps we should allow a way for
224 the signature to scale up as it grows.
226 * Perhaps make extracted signatures still be wrapped in commands.
227 What would this lead to?
229 * We'd know how much signature data we expect to read, rather than
230 requiring it to be terminated by the caller.