FreePastry release notes
Release 1.4, 3/15/05.
FreePastry is a modular, open source implementation of the Pastry p2p structured overlay network.
Contributors
Peter Druschel ,
Eric Engineer ,
Romer Gil ,
Andreas Haeberlen ,
Jeff Hoye ,
Y. Charlie Hu ,
Sitaram Iyer ,
Andrew Ladd ,
Alan Mislove ,
Animesh Nandi ,
Ansley Post ,
Charlie Reis ,
Dan Sandler ,
Atul Singh , and
RongMei Zhang
contributed to the FreePastry code. The code is based on algorithms and
protocols described in the following papers:
- A. Rowstron and P.
Druschel, Pastry: Scalable, distributed object location and routing
for large-scale peer-to-peer systems. IFIP/ACM International
Conference on Distributed Systems Platforms (Middleware), Heidelberg,
Germany, pages 329-350, November, 2001. [
pdf.zip |
ps.zip | pdf
| ps ]
- M. Castro, P. Druschel,
Y. C. Hu, A. Rowstron, Proximity neighbor selection in tree-based structured peer-to-peer overlays. Microsoft Technical Report MSR-TR-20032-52, 2003.
[
pdf.zip
| ps.zip
|pdf
|ps
]
- M. Castro, P. Druschel,
A.-M. Kermarrec and A. Rowstron, "SCRIBE: A large-scale and
decentralized application-level multicast infrastructure", IEEE
Journal on Selected Areas in Communications (JSAC) (Special issue on
Network Support for Multicast Communications). 2002, to appear. [
pdf.zip |
ps.zip | pdf
| ps
]
- A.
Rowstron and P. Druschel, "Storage management and caching in PAST, a
large-scale, persistent peer-to-peer storage utility", 18th
ACM SOSP'01, Lake Louise, Alberta, Canada, October 2001. [ pdf.zip
| ps.zip
| pdf
| ps
] (Corrected - erratum for original version: ps)
-
F. Dabek,
P. Druschel, B. Zhao, J. Kubiatowicz, and I. Stoica, "Towards a
Common API for Structured Peer-to-Peer Overlays", 2nd IPTP'03,
Berkeley, CA, February, 2003.
[ pdf
]
-
Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Animesh Nandi, Antony Rowstron and Atul Singh,
SplitStream: High-bandwidth multicast in a cooperative environment.
In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03). Lake George, New York, October 2003.
[ pdf ]
Requirements
The software requires a Java runtime, version 1.4.2. The software was
developed using Sun's SDK, version 1.4.2_03+
Changes since release 1.3.2
- Supports ePOST, which is now in production use.[Download]
- Single threaded:
We have adapted a single threaded model to improve performance. Calls into Pastry are still properly synchronized. In fact if you run multiple nodes within the same JVM they will execute on the same thread. The scheduler is implemented in rice.selector.
- Transport Layer/Routing:
- Removed rice.pastry.RMI, rice.pastry.Wire -- Use rice.pastry.Socket
- Simplified transport layer. The Socket transport layer uses TCP for all messaging except liveness.
- Improved liveness checking, better support for churn. The Socket transport layer uses UDP only for liveness checks, and they are sent using random exponential backoff.
- Improved support for PlanetLab and the Internet. Gracefully handles temporary and permanent routing anomalies by using sourceroutes. The sourceroutes are selected from nodes within the leafset.
- Improved routing performance. Aggressive routing around nodes that may have stalled. We now use per hop acks to rapidly route around stalled or congested nodes.
- Support for a fixed file descriptor limit on sockets. This is used for nodes that need to fix the maximum number of concurrently opened sockets. Typically this is used if you don't have sufficient privilege to raise the maximum number of open file descriptors for a process. (See ulimit -n).
- Support for fine grained prioritization of message delivery. We added
int getPriority()
to the message interface. The transport will prioritize higher priority messages over low priority.
- Support for multiple bootstrap nodes. The socket transport layer can take a list of IP addresses to try to boot off of. It will attempt to connect to them in random order and return the first one it is able to connect to. (See
DistPastryNodeFactory.getNodeHandle(InetSocketAddress[])
)
- Support for NodeId reuse. Previous versions randomized the last 32 bits of a NodeId. The new version uses an epoch to determine if a node has rebooted since you last talked to it.
- Modules:
- p2p.commonapi -- Added priority to messages.
- p2p.past -- Added lease-based past version in p2p.past.gc - extends the PAST interface.
- p2p.scribe -- Minor changes.
- p2p.replication -- This has been modified to use bloom filters instead of key lists for replication.
- p2p.splitstream -- Minor scalability improvements based on planetlab deployment.
- (NEW) p2p.util -- Various utilities for p2p packages, including Cryptography and XML.
- (NEW) p2p.multiring -- An implementation of the IPTPS paper, complete with optional RingCertificates.
- (NEW) p2p.aggregation -- Improves DHT efficiency by aggregating small objects
- Aggregation: This module can increase the efficiency of a DHT by bundling several small
objects into a larger aggregate. It is used by Glacier, but can also be
combined with other DHTs such as PAST.
- (NEW) p2p.glacier -- Protects against data loss during large-scale correlated failures
- Glacier: Distributed storage systems can suffer data loss when a large fraction
of the storage nodes fail simultaneously, e.g. during a worm attack.
Glacier protects against this by spreading redundant data fragments
throughout the system. It can reconstruct the data with high probability
even after disastrous failures that may affect 60% of the nodes or more.
Please see our NSDI paper on Glacier for a more detailed description.
- persistence -- Many improvements and bug fixes. Basically the same interface. Should be now *MUCH* more robust. Also added metadata support to the rice.persistence package classes.
- (NEW) selector -- Allows pastry to run on a single thread.
Changes since release 1.3.1
- Overhaul of the wire package. The new version is higher
performance and has eliminated some known synchronization issues that
can cause deadlock. Furthermore the package produces the more sensible
NodeIsDeadException if a message is attempted to be sent after the node
has simulated being killed. Killing of nodes remains for testing, and is not supported on all platforms.
- New version of Replication Manager in rice.p2p.replication.
Based on commonapi instead of pastry. There is an alternate/simpler
interface to rm in the rice.p2p.replication.manager package.
- Past has been migrated to the new rm (p2p.replication).
Changes since release 1.3
- New version of Scribe implemented on the common API. New
version
is re-designed to increase the performance, reliability and ease of
use. The previous version of Scribe is still include in the release.
- New version of PAST provides a method to obtain handles to all
replicas, and various bug fixes. The old version of PAST that was built
on top of the Pastry API is now deprecated. The version built upon the
commonApi should now be used.
- The discovery protocol for automatic location of nearby node given
any bootstrap node is now implemented. This is described in the Pastry
proximity paper.
- A prototype implementation of SplitStream, a high bandwidth
multicast system,is now released. This implementation does not yet
implement all of the optimizations described in the paper; therefore,
overheads maybe higher than those reported in the paper. See below
for more details about using SplitStream.
Changes since release 1.2
- FreePastry now supports the common API, as described in the
IPTPS'03 paper listed above. Newly developed applications should
use this API, and only import the p2p.commonapi package. The previous,
native FreePastry API continues to be supported for backward
compatibility.
- A more general implementation of the PAST
archival storage system was added in this release. The release adds
support for replication and caching of data. The implementation
provides a generic distributed hash table (DHT) facility, and allows
control over the semantics of tuple insertion for a given,
application-specific value type. The previous version of PAST has been
marked as deprecated and may not be included in future releases.
Applications that use Past should migrate to the new version.
- A version of the replication manager, which provides
application-independent management of replicas, is included.
Application that need to replicate data on the set of n nodes
closest to a given key can use the replication manager in order to
perform this task.
Changes since release 1.1
- A simple implementation of the PAST
archival storage system was added in this release. The implementation
does not currently perform the storage balancing algorithms described in
the SOSP paper, nor does it perform data replication or caching. Support
for replication and caching will be included in the next release.
- An anycast primitive was added to the implementation of Scribe, a
group communication infrastructure. Also, several methods and new
interfaces and a new interface were added to provide apps more control
over the construction and maintenance of Scribe trees.
- Some initial performance work was done. As a result, large
simulations run about 50% faster, and use a lot less memory.
Notes
Release 1.4 has the following limitations.
- More performance tuning needs to be done.
- Two "transport protocols" are provided with this release,
"Socket", and "Direct".
- "Direct" emulates a network, allowing many Pastry nodes to
execute in one Java VM without a real network. This is very useful for
application development and testing.
- "Socket" uses an event-based implementation based on sockets, and
uses the non-blocking NIO support in Java 1.4.2. It uses TCP for all
communication except liveness checks which are UDP. This version is much
more robust than prior distributions and has been tested extensively on
planetlab.
Support for simulated killing of nodes is for testing purposes only.
On Unix systems, Java's socket implementation uses File
Descriptors. In this implementation, the File Descriptors can be used
up if too many nodes are running in a single process. We have a soft
limit on the number of file descriptors, but we are aware that you can
exceed this number under some scenarios. For example if you send messages
to to more nodes than you have file descriptors without giving up the
thread to allow io. If you require running more than one node inside a
single process, consider increasing the number of File Descriptors per
process (bash: ulimit -n), or lowering the available sockets per node
(rice.pastry.socket.SocketCollectionManager.MAX_OPEN_SOCKETS).
- Security support does not exist in this release. Therefore,
the software should only be run in trusted environments. Future
releases will include security.
(Background: To start a Pastry node, the IP address (and
port number, unless the default port is used) of a "bootstrap" or
"contact" node must be provided. If no such node is provided, and no
other Pastry node runs on the local machine, then FreePastry creates a
new overlay network with itself as the only node. Any node that is
already part of the Pastry node can serve as the bootstrap node.)
- The Scribe implementation included in this release does not yet
support the tree optimization techniques describe in Sections IV, E-F of
the
Scribe paper.
Installation
To use the binary distribution, download the pastry jar file and set
the Java classpath to include the path of the jar file. This can be done
using the "-cp" command line argument, or by setting the CLASSPATH
variable in your shell environment. For some applications you may need the
3rd party libraries included with the distribution. These are available
in the source distributions. Simply unpack the distribution and include the
jars in the lib/ directory in your classpath.
To compile the source distribution we have switched to ant for the build
process. You will need to have ANT installed (available from
http://ant.apache.org/) on your
system. Expand the archive (FreePastry-1.4-source.tgz or FreePastry-1.4-source.zip)
into a directory. Execute "ant" in the top level
directory (you may have to increase the maximum memory for ant by setting the
environment variable ANT_OPTS=-Xmx128m), then change
to the "classes" directory to run FreePastry.
You may have to provide a Java security policy file with sufficient
permissions to allow FreePastry to contact other nodes. The simplest way
to do this is to install a ".java.policy" file with the following
content into your home directory:
grant {
permission java.security.AllPermission;
};
Running FreePastry
1. To run a HelloWorld example:
java [-cp pastry.jar] rice.pastry.testing.DistHelloWorld
[-msgs m] [-nodes n] [-port p] [-bootstrap bshost[:bsport]] [-protocol [socket]]
[-verbose|-silent|-verbosity v] [-help]
Without -bootstrap bshost[:bsport], only localhost:p is used for bootstrap.
Default verbosity is 5, -verbose is 10, and -silent is -1 (error msgs only).
(replace "pastry.jar" by "FreePastry-<version>.jar", of course)
Some interesting configurations:
a. java rice.pastry.testing.DistHelloWorld
Starts a standalone Pastry network, and sends two messages
essentially to itself. Waits for anyone to connect to it,
so terminate with ^C.
b. java rice.pastry.testing.DistHelloWorld -nodes 2
One node starts a Pastry network, and sends two messages to
random destination addresses. At some point another node
joins in, synchronizes their leaf sets and route sets, and
sends two messages to random destinations. These may be
delivered to either node with equal probability. Note how
the sender node gets an "enroute" upcall from Pastry before
forwarding the message.
c. java rice.pastry.testing.DistHelloWorld -nodes 2 -verbose
Also prints some interesting transport-level messages.
d. pokey$ java rice.pastry.testing.DistHelloWorld
gamma$ java rice.pastry.testing.DistHelloWorld -bootstrap pokey
Two machines coordinate to form a Pastry network.
e. pokey$ java rice.pastry.testing.DistHelloWorld
gamma$ java rice.pastry.testing.DistHelloWorld -bootstrap pokey
wait a few seconds, and interrupt with <ctrl-C>
gamma$ java rice.pastry.testing.DistHelloWorld -bootstrap pokey
The second client restarts with a new NodeID, and joins the
Pastry network. One of them sends messages to the now-dead
node, finds it down, and may or may not remove it
from the leaf sets. (repeat a few times to observe both
possibilities, i.e., leaf sets of size 3 or 5). If the
latter, then leaf set maintenance kicks in within a minute
on one of the nodes, and removes the stale entries.
f. pokey$ java rice.pastry.testing.DistHelloWorld
gamma$ java rice.pastry.testing.DistHelloWorld -bootstrap pokey -nodes 2
The client on gamma instantiates two virtual nodes, which
are independent in identity and functionality. Note how the
second virtual node bootstraps from the first (rather than
from pokey). Try starting say 10 or 30 virtual nodes, killing
with a <ctrl-C>, starting another bunch, etc.
2. To run the same HelloWorld application on an emulated network:
java [-cp pastry.jar] rice.pastry.testing.HelloWorld [-msgs m] [-nodes n] [-verbose|-silent|-verbosity v] [-simultaneous_joins] [-simultaneous_msgs] [-help]
Some interesting configurations:
a. java rice.pastry.testing.HelloWorld
Creates three nodes, and sends total three messages from
randomly chosen nodes to random destinations addresses
(which are delivered to the node with the numerically
closest address).
b. java rice.pastry.testing.HelloWorld -simultaneous_joins -simultaneous_msgs
Join all three nodes at once, then issue three messages,
then go about delivering them.
3. To run a regression test that constructs 500 nodes connected by an
emulated network:
java [-cp pastry.jar] rice.pastry.testing.DirectPastryRegrTest
4. To run a simple performance test based on an emulated network with
successively larger numbers of nodes:
java [-cp pastry.jar] rice.pastry.testing.DirectPastryPingTest
Writing applications on top of FreePastry
Applications that wish to use the native Pastry API must extend the
class rice.pastry.client.PastryAppl. This class implements the Pastry
API. Each application consists minimally of an application class that
extends rice.pastry.client.PastryAppl, and a driver class that
implements main(), creates and initializes one of more nodes, etc.
Example applications and drivers can be found in rice.pastry.testing;
the Hello World suite (HelloWorldApp.java, HelloWorld.java,
DistHelloWorld.java) may be a good starting point.
Another sample Pastry application is rice.scribe.
Application writers are strongly encouraged to base newly written
applications on the new common API. Such applications should import the
package rice.p2p.commonapi.
Running Scribe
1. To run a simple distributed test:
java [-cp pastry.jar] rice.p2p.scribe.testing.ScribeRegrTest [-nodes n] [-port p] [-bootstrap bshost[:bsport]] [-protocol (direct|socket)] [-help]
Ports p and bsport refer to contact port numbers (default = 5009).
Without -bootstrap bshost[:bsport], only localhost:p is used for bootstrap.
(replace "pastry.jar" by "FreePastry-<version>.jar", of course)
Running PAST
1. To run a simple distributed test:
java [-cp pastry.jar] rice.p2p.past.testing.PastRegrTest [-nodes n] [-protocol (direct|socket)]
This creates a network of n nodes (10 by default), and then
runs the Past regression test over these nodes.
Running SplitStream
The FreePastry implementation of SplitStream implements the system
described in the SOSP '03 paper.
SplitStream.java class provides an interface that can be used by
applications to create SplitStream instances. Each SplitStream forest
is represented by a channel object (Channel.java), where a channel
object encapsulates multiple stripe trees. Each stripe tree for a
SplitStream forest is represented by a class (Stripe.java), which
handles the data reception and subscription failures.
Applications can configure the maximum capacity each channel can
accommodate in terms of number of children it is willing to accept. Applications can control total
outgoing capacity they are willing to provide by changing the value in ScribeSplitStreamPolicy.java.
1. To run a simple distributed test:
java [-cp pastry.jar] rice.p2p.splitstream.testing.SplitStreamRegrTest [-nodes n] [-protocol (direct|socket)]
This creates a network of n nodes (10 by default), and then
runs the SplitStream regression test over these nodes.