The contrail BGP implementation was designed from scratch to run on modern server environemnts. The main goals where to be able to take advantage of multicore CPUs, large (>4G) memory footprints and modern software development techniques. Contrail's initial target application (network virtualization) does not in itself require an implementation to support a high rate of routing updates or a large number of routes but the engineering team felt that given the opportunity to write an implementation from scratch we wanted to be able to tackle the problem of concurrency from the beginning.
The BGP protocol can scale horizontally by adding BGP speakers which are then federated as a Route Reflector mesh. That approach is very effective when scaling up in one of the parallelizable dimensions: number of peers; less so when it come to scaling in the order dimension: number of routes. While it is possible to implement a load-balancer like mechanism that would proxy the BGP session of a peer and distribute the route load accross multiple route reflection meshes the engineering team felt that this approach would expose too much complexity in situations where the scalling requirements are not particularly demanding.
As a result we adopted the more traditional approach of designing a multi-threaded implementation in a single process that had the goal of parallelizing both peer processing as well as route processing. In BGP routes with different prefixes can be processed independently is most circustances, except when implementing functionality such as route aggregation or next-hop resolution.
Having decided to use multi-threading we also knew we needed to find an alternative to the use of mutexes as the primary mechanism to express data concurrency policies. Without a design that put a framework around concurrency and localizes locking to specific modules of the code multithreading can become a very significant burden on feature development and code stability.
The concurrency design we ended up selecting is based on two concepts:
In a CPU design, a pipeline stage is independent from the stages that preceed and follow it, except for a well defined set of registers for which there is a locking mechanism in place. An example of dataset parallelism if an NPU that processes packets in parallel; or an hadoop job.
The implementation does not capture these concepts
literally but achieves what we consider to be a reasonable approximation. The
Task class
allows the programmer to specificy that a particular callback should
executed when determined by a given
The application can then specificy scheduling policies using these
parameters. As an example, processing configuration changes is
mutually exclusive with any other Task; code that runs under the
bgp::Config
task can modify any data-structure without
explicitly acquiring a mutex. The bulk of BGP processing however is
performed by tasks with a specific
BGP can be divided in the following components:
For input processing concurrency can be acomplished by processing
each peer individually. Routing table processing is parallelized by
sharding the table with as many shards as the number of CPUs;
an hash function is applied to the route prefix selecting the shard
(or db partition
). Update encoding is performed on a
peer group basis; a peer is assigned to a peer group if it has the
same policy
when it comes to the attributes that it
generates. Artificially we add cpu-affinity as a policy attribute in
order to divide peers accross groups.
The goal for the BGP implementation is to have testing at:
Unit testing drives code organization. Classes should be designed such that their interface provides a service which can be tested as a unit. The best way to acomplish this is to follow the TDD principle of writing the test of the interface before the implementation.
The bgp implementation is instantiated via a single local
main()
. From the server object there is a tree of all the
other objects that componse the implementation. This allows the test
code to run multiple BGP servers within the same process. Along with
dependency injection, it also
allows the test code to replace any class or set of classes when
performing module or process level testing.
Dependency injection is acomplished via a factory
mechanism. The
bgp
factory configuration defines the default implementations for
each interface. Test code can override the defaults by overriding the
registration as with the following example:
BgpObjectFactory::Register<BgpSessionManager>(boost::factory<BgpSessionManagerMock *>());
Unit tests are executed using the
Google C++ Testing Framework
. Mock objects are built either manually when trivial or by using
the corresponding Mocking Framework.
The BGP test suite can be executed via the build target
src/bgp:test
. Code coverage can be generated by passing
the flag --optimization=coverage
to the build tool (scons).
BGP protocol messages are decoded and validated before being placed in the routing table operations queue. We use a mini-DSL, implemented as C++ templates, in order to decode and encode BGP messages.
This DSL is generic and can be used to implement encoding and decoding for different TLV-style protocols.
The BGP grammar is declared in bgp_proto.cc. Each
node in the grammar declares either a "Sequence"
of
elements to parse or a "Choice"
between multiple
elements.
Simple elements store
and retrieve the message data to/from an in-memory datastructure that
is then used internally.
For example, the following grammar node stores the autonomous-system
parameter in an OPEN message in the "as_num" field of an object of
type "BgpProto::OpenMessage":
class BgpOpenAsNum : public ProtoElement<BgpOpenAsNum> {
public:
static const int kSize = 2;
typedef Accessor<BgpProto::OpenMessage, uint32_t, &BgpProto::OpenMessage::as_num> Setter;
};
Complex elements (Sequences and Choices) can then include simple
elements. The example element above is used in the open message
definition with the following:
class BgpOpenMessage : public ProtoSequence<BgpOpenMessage> {
public:
typedef mpl::list<BgpOpenVersion,
BgpOpenAsNum,
BgpHoldTime,
BgpIdentifier,
BgpOpenOptParam> Sequence;
typedef BgpProto::OpenMessage ContextType;
};
The objective behind this approach is to eliminate trivial mistakes such as reading beyond the message boundary. When TLV decoding is manually generated it is necessary to add simple validation checks through the parsing code for every single component. With an automated approach it is possible to ensure consistency.
An important requirement for the DSL is to allow the programmer to define the in-memory datastructure used to then store the resulting data. The concern is that the DSL should not force the programmer to then copy the data manually between a data structure that is defined by a library or code generator and a data structure that is convinient for the code that then processes the message.
Routing tables are in-memory datastructures that organized such that one module of code can manage the contents by performing add/change/delete operations and multiple modules can be "listeners" to table events. This is often referred to as the "observer pattern". Routing protocol implementation commonly use this approach; it is used by GateD and its derivatives for instance. Contrail implements this pattern via the db library which is used by several different modules like, for instance, the configuration management code in both the control-node and compute-node agent.
The contrail "db" library supports the notion of having a
DBTable
that has multiple partitions (or shards) that
correspond to multiple datasets. For BGP routing tables, the partition
is determined by hashing the NLRI prefix. The hash function is such
that it ignores the RD component of L3VPN routes. This guarantees that
an L3VPN route is in the same partition as the corresponding IP route
allowing for VRF import and export processing to be done in the
context of the same partition (dataset).
Inbound messages are placed into a workqueue associated with a
database partition. The workqueue deals with concurrency between Input
processing and routing table stages. The table's Input
method is called whenever there is work to be processed. It converts
the incoming message into a table Add/Delete/Change operation.
The "db" library will then notify any listeners of the particular table and present them with with a change set. The change notification calls all listeners of a particular table in sequence. This is where operations such as "Update generation" and VRF import/export are performed.
For each routing table, the implementation creates a
RibOut
object for each different "Export Policy", where
the concept of export policy is the set of factors that influence how
updates are calculated. CPU affinity can also be part of export policy
in order to force additional parellelization of update dequeuing.
Whenever there is a change in a routing table entry, the
BgpExport::Export
method is called for each RibOut. This
call occurs under the context of the "Routing Table Operations" task
and respective table partition. The "Export" method calculates the
desired attributes for the route and manipulates the update queue
associated with the RibOut.
The update manipulation code will ensure that there is a single update enqueued for an NLRI at a given time removing any previously enqueued updates for a prefix that is changing. Changes that are redundant are also supressed. This is achieved by comparing the desired state for the route to the state last advertised to the peer. Being able to suppress pending updates can reduce the number of messages sent to a slow receiver.
BGP peers are assigned to a SchedulingGroup
. This is a
group of BGP peers which advertise the same RibOuts. A scheduling
group can be processed in parallel with other scheduling groups.
Concurrency is managed by the RibUpdateMonitor
class. The update queue has multiple producers (one per routing table
partition) and a single consumer: the scheduling group task that
dequeues updates.
Once an update is dequeued the corresponding message is formatted and written to the TCP socket of each peer. When a TCP socket is full, the respective peer is taken out of the current send set and a marker is added to the queue. Once the socket becomes available for writing again the send processing resumes from the last enqueued marker.
When used in the control node process, BGP derives its
internal configuration from the network configuration distributed by the IFMAP
server. This functionality is performed by the BgpConfigManager
class.
Unit tests often use an XML configuration parser to setup the
desired state of the BGP server. The BgpConfigParser
class translates an XML file into a set of ifmap identity and metadata
as expected by the configuration manager; tests can also
replace the default config manager.
The control-node process (which includes BGP) runs a web server on port 8083. BGP diagnostics information is available via this web interface.