As of release 1.1, Hyperledger Sawtooth supports dynamic consensus through its consensus API and SDKs. These tools, which were covered in a previous blog post, are the building blocks that make it easy to implement different consensus algorithms as consensus engines for Sawtooth. We chose to implement the Raft algorithm as our first consensus engine, which we describe in another blog post. While our Raft implementation is an excellent proof of concept, it is not Byzantine-fault-tolerant, which makes it unsuitable for consortium-style networks with adversarial trust characteristics.
To fill this gap, we chose the Practical Byzantine Fault Tolerance (PBFT) consensus algorithm. We started work on the Sawtooth PBFT consensus engine in the summer of 2018 and continue to develop and improve on it as we work towards its first stable release. This blog post summarizes the PBFT algorithm and describes how it works in Sawtooth.
What Is PBFT?
PBFT dates back to a 1999 paper written by Miguel Castro and Barbara Liskov at MIT. Unlike other algorithms at the time, PBFT was the first Byzantine fault tolerant algorithm designed to work in practical, asynchronous environments. PBFT is thoughtfully defined, well established, and widely understood, which makes it an excellent choice for Hyperledger Sawtooth.
PBFT is similar to Raft in some general ways:
- It is leader-based and non-forking (unlike lottery-style algorithms)
- It does not support open-enrollment, but nodes can be added and removed by an administrator
- It requires full peering (all nodes must be connected to all other nodes)
PBFT provides Byzantine fault tolerance, whereas Raft only supports crash fault tolerance. Byzantine fault tolerance means that liveness and safety are guaranteed even when some portion of the network is faulty or malicious. As long as a minimum percentage of nodes in the PBFT network are connected, working properly, and behaving honestly, the network will always make progress and will not allow any of the nodes to manipulate the network.
How Does PBFT Work?
The original PBFT paper has a detailed and rigorous explanation of the consensus algorithm. What follows is a summary of the algorithm’s key points in the context of Hyperledger Sawtooth. The original definition is broadly applicable to any kind of replicated system; by keeping this information blockchain-specific, we can more easily describe the functionality of the Sawtooth PBFT consensus engine.
A PBFT network consists of a series of nodes that are ordered from 0 to n-1, where n is the number of nodes in the network. As mentioned earlier, there is a maximum number of “bad” nodes that the PBFT network can tolerate. As long as this number of bad nodes—referred to as the constant f—is not exceeded, the network will work properly. For PBFT, the constant f is equal to one third of the nodes in the network. No more than a third of the network (rounded down) can be “out of order” or dishonest at any given time for the algorithm to work. The values of n and f are very important; you’ll see them later as we discuss how the algorithm operates.
Figure 1 — n and f in the PBFT algorithm
As the network progresses, the nodes move through a series of “views”. A view is a period of time that a given node is the primary (leader) of the network. In simple terms, each node takes turns being the primary in a never-ending cycle, starting with the first node. For a four-node network, node 0 is the primary at view 0, node 1 is the primary at view 1, and so on. When the network gets to view 4, it will “wrap back around” so that node 0 is the primary again.
In more technical terms, the primary (p) for each view is determined based on the view number (v) and the ordering of the nodes. The formula for determining the primary for any view on a given network is p = v mod n. For instance, on a four-node network at view 7, the formula p = 7 mod 4 means that node 3 will be the primary (7 mod 4 = 3).
In addition to moving through a series of views, the network moves through a series of “sequence numbers.” In the context of a Sawtooth blockchain, a sequence number is equivalent to a block number; thus, saying that a node is on sequence number 10 is the same as saying that the node is performing consensus on block 10 in the chain.
Each node maintains a few key pieces of information as part of its state:
- The list of nodes that belong to the network
- Its current view number
- Its current sequence number (the block it is working on)
- The phase of the algorithm it is currently in (see “Normal-Case Operation”)
- A log of the blocks it has received
- A log of all valid messages it has received from the other nodes
To commit a block and make progress, the nodes in a PBFT network go through three phases:
Figure 2 shows these phases for a simple four-node network. In this example, node 0 is the primary and node 3 is a faulty node (so it does not send any messages). Because there are four nodes in the network (n = 4), the value of f for the network is 4-13=1. This means the example network can tolerate only one faulty node.
To kick things off, the primary for the current view will create a block and publish it to the network; each of the nodes will receive this block and perform some preliminary verification to make sure that the block is valid.
After the primary has published a block to the network, it broadcasts a pre-prepare message to all of the nodes. Pre-prepare messages contain four key pieces of information: the ID of the block the primary just published, the block’s number, the primary’s view number, and the primary’s ID. When a node receives a pre-prepare message from the primary, it will validate the message and add the message to its internal log. Message validation includes verifying the digital signature of the message, checking that the message’s view number matches the node’s current view number, and ensuring that the message is from the primary for the current view.
The pre-prepare message serves as a way for the primary node to publicly endorse a given block and for the network to agree about which block to perform consensus on for this sequence number. To ensure that only one block is considered at a time, nodes do not allow more than one pre-prepare message at a given view and sequence number.
Once a node has received a block and a pre-prepare message for the block, and both the block and message have been added to the node’s log, the node will move on to the preparing phase. In the preparing phase, the node will broadcast a prepare message to the rest of the network (including itself). Prepare messages, like pre-prepare messages, contain the ID and number of the block they are for, as well as the node’s view number and ID.
In order to move onto the next phase, the node must wait until it has received 2f + 1 prepare messages that have the same block ID, block number, and view number, and are from different nodes. By waiting for 2f + 1 matching prepare messages, the node can be sure that all properly functioning nodes (those that are non-faulty and non-malicious) are in agreement at this stage. Once the node has accepted the required 2f + 1 matching prepare messages and added them to its log, it is ready to move onto the committing phase.
When a node enters the committing phase, it broadcasts a commit message to the whole network (including itself). Like the other message types, commit messages contain the ID and number of the block they are for, along with the node’s view number and ID. As with the preparing phase, a node cannot complete the committing phase until it has received 2f + 1 matching commit messages from different nodes. Again, this guarantees that all non-faulty nodes in the network have agreed to commit this block, which means that the node can safely commit the block knowing that it will not need to be reverted. With the required 2f + 1 commit messages accepted and in its log, the node can safely commit the block.
Once the primary node has finished the committing phase and has committed the block, it will start the whole process over again by creating a block, publishing it, and broadcasting a pre-prepare message for it.
In order to be Byzantine fault tolerant, a consensus algorithm must prevent nodes from improperly altering the network (to guarantee safety) or indefinitely halting progress (to ensure liveness). PBFT guarantees safety by requiring all non-faulty nodes to agree in order to move beyond the preparing and committing phases. To guarantee liveness, though, there must be a mechanism to determine if the leader is behaving improperly (such as producing invalid messages or simply not doing anything). PBFT provides the liveness guarantee with view changes.
When a node has determined that the primary of view v is faulty (perhaps because the primary sent an invalid message or did not produce a valid block in time), it will broadcast a view change message for view v + 1 to the network. If the primary is indeed faulty, all non-faulty nodes will broadcast view change messages. When the primary for the new view (v + 1) receives 2f + 1 view change messages from different nodes, it will broadcast a new view message for view v + 1 to all the nodes. When the other nodes receive the new view message, they will switch to the new view, and the new primary will start publishing blocks and sending pre-prepare messages.
View changes guarantee that the network can move on to a new primary if the current one is faulty. This PBFT feature allows the network to continue to make progress and not be stalled by a bad primary node.
Want to Learn More?
This blog post only scratches the surface of the PBFT consensus algorithm. Stay tuned to the Hyperledger blog for more information on PBFT, including a future post about our extensions and additional features for Sawtooth PBFT.
About the Author
Logan Seeley is a Software Engineer at Bitwise IO. He has been involved in a variety of Hyperledger Sawtooth projects, including the development of the consensus API, Sawtooth Raft, and Sawtooth PBFT.