System Design: Event-Driven Architecture

8 min readNov 20, 2022

The event-driven architecture is an asynchronous architecture style suitable for highly scalable and high-performance distributed systems. It uses an event-based model rather than a request-based model, and the difference is described in the next section.

Request-Based vs. Event-Based Models

A request-based model uses a request orchestrator to synchronously receive requests and direct them to the request processors. The request processors handle the requests and either retrieves or updates information in a database.

An event-based model consists of event processing components that react to events instead of requests. Events are certain situations that occur when some condition is met. For example, when enough data is produced by a data producer to batch it up for processing and storage.

Generally, the request-based model is better for well-structured requests that require certainty in response. The event-based model is better for flexible requests that require high responsiveness and complex processing. In many cases, request-based and event-based architecture are used in conjunction in what’s known as hybrid architecture.

Topology

There are two major topologies of event-driven architecture: the mediator topology and the broker topology. The architecture characteristics and trade-offs are different between these two topologies, and the best choice depends on the specific situation.

Broker Topology

The broker topology is the simpler of the two. It’s most useful when the event processing flow is relatively straightforward, and there is no need for centralized event orchestration and coordination. It uses a message queue to direct messages from the initiating event to the distributed event processor components in a chain-like broadcasting manner.

There are four major components in the broker topology:

Initiating Event: the event that starts the entire events flow
Event Broker: the channel for queueing up events for processing
Event Processor: accepts the initiating event and performs a specific task associated with processing this event
Processing Event: the event produced by the event processor to signal to the rest of the system that the initiating event has been processed. This event can then be picked up by other event brokers as their initiating event

Because the entire architecture uses decoupled event brokers that handle events for their specific domain in an asynchronous manner, and events are published in a fire-and-forget and broadcasting manner, the common approach is to use topics in a Pub-Sub style messaging model.

The best practice when publishing events in to broadcast them to the entire system, even if some event processors do not care about some of the events. This is to ensure that messages are not lost, and allows the system to be more extensible in adding new event processors, because these new event processors would receive all events by default.

In the broker topology, all of the event processors are decoupled and independently of each other. This allows each event processor to scale independently based on the varying load conditions for the types of events that it processes. This decoupled nature does make error handling more difficult, because event processors are not aware of each other, so if a failure occurs in one event processor, other event processors would not be aware of the problem and may make mistakes as well (e.g. decrement the count of a certain product in the inventory after a sale, but the order confirmation event processor failed so the sale did not complete).

Mediator Topology

The mediator topology addresses the drawback of the broker topology described above. The core concept of this topology is that events are centrally managed and delivered to the event processors.

There are five major components in the mediator topology:

Initiating Event: the event that starts the entire events flow
Event Queue: accepts incoming initiating events and queues them up to be processed by the event mediator
Event Mediator: generates processing events to be sent to the dedicated event channels in a point-to-point messaging fashion
Event Channel: accepts incoming processing events and queues them up to be processed by the event processor
Event Processor: performs a specific task associated with processing this event

The major differences from the broker topology are:

Initiating events are initially processed by a centralized event mediator rather than by individual event processors
Event processors do not broadcast the outgoing events to all other event processors
The event mediator maintains event state and can perform error handling, whereas in the broker topology, event processors create events in a fire-and-forget manner
Events tend to be commands (e.g. place-order, send-email), whereas events in the broker topology tend to be statuses (e.g. order-placed, email-sent)
Events are always expected to be processed

Choosing an event mediator depends on the complexity of the conditional handling and error handling with respect to processing the initiating events. In some cases, the architecture consists of a simple event mediator that sends some events directly to the event processors, and sends other events that require further mediating to a complex event mediator. It is also common to have multiple event mediators scoped by specific business domains.

Although the mediator topology offers better error handling than the broker topology, one drawback is that it relies on a centralized event mediator, which sometimes becomes the bottleneck in the entire system and makes scaling the entire system more difficult. Another drawback is that performance is hindered by the event mediator.

Key Properties

Asynchronous Communications

The event-driven architecture uses asynchronous communication for all requests and replies in the system. This means that when a user makes a request to the system (e.g. post a comment on a website), the request does not have to be fully processed in order to respond back to the user. The user is given an acknowledgement as soon as the request is received, and the system will asynchronously process the request. From the user’s perspective, the request is complete as soon as the acknowledgement is received. This method of communication increases the responsiveness of the system without needing to increase the performance. Even if processing the request takes a long time, the user still sees a very responsive application. The drawback, however, is the complexity of error handling, which is discussed next.

Error Handling

In an event-based architecture, events are passed from the event producer to the event consumer asynchronously. If the event consumer experiences an error while processing the event, it needs to provide feedback to the system, and move on to process subsequent events. If the event consumer spends time handling the error, it would impact the overall responsiveness of the system.

Errors are usually propagated to a workflow processor that handles these errors. It may choose to programmatically make changes to the original message data using some kind of anomaly detection and repair algorithm, and send the message back into the original event queue. If the workflow processor cannot determine what is wrong, then it could send the message to a dedicated dashboard that tracks these errors and presents them to humans to fix.

One problem with this error handling mechanism is that events that are resubmitted into the queue would be processed out of order. In some applications this may not matter. If the order does matter (e.g. financial transactions), then a more complex error handling mechanism is needed to make sure events within a certain context (e.g. for the same bank account) are processed in order, which means events in the same context would be paused until the error is resolved.

Data Loss Prevention

Sometimes, messages may get dropped and never reach the intended destination. Messages may be dropped in many different places, such as:

After an event processor publishes an event, the event is dropped before reaching the event channel
An event is dropped when an event processor de-queues it from the event channel but crashes while processing the event
An event processor is unable to persist the message to the database

Issue 1 can be resolved by using persistent message queues and synchronous send. Persistent message queues support guaranteed delivery, meaning that the message is not only stored in memory in the message broker, but is also persisted to disk. That way, if the message broker goes down, messages are not lost, and they can be retrieved when the message broker comes back up. Synchronous send means that event producers are blocked from sending more messages until the previous one has been persisted.

Issue 2 can be resolved by using the client acknowledge mode, which means that messages are not removed from the event channel until the event consumer acknowledges that it has been processed successfully.

Issue 3 can be resolved by using ACID transactions in database commits. This means that once a database commit happens, the data is guaranteed to have persisted in the database. Since the database commit is the last participant in the entire process, we are safe to remove the message from the persistent message queue, and this mechanism is called last participant support.

Request-Reply

The nature of asynchronous communication is that requests do not receive a reply. However, sometimes a reply is desirable in an event-driven architecture, such as a confirmation that some process is complete. In that case, request-reply messaging, or pseudo-synchronous communication can be used. This is achieved by having two queues in the event channel: a request queue and a reply queue. The initial request is sent to the request queue. The event consumer of the request queue processes the request and sends a reply message to the reply queue. The event producer of the request then receives the reply and understands that the requested action has been complete.

Trade-Offs

Advantages

Event-driven architecture has very high performance due to its asynchronous communications and highly parallel processing.
It achieves high scalability because event processors can be easily scaled independently of each other.
The highly decoupled and asynchronous nature makes this architecture highly fault tolerant. This is because events do not need to be processed immediately, so if an event processor goes down, it can continue to process the queue after it comes back up.
This architecture is highly evolutionary, meaning new features can be easily added through existing or new event processors.

Disadvantages

This architecture can be very complex and expensive to implement.
Testability is relatively low because the event flows are not deterministic and the events can be very dynamic. The event flow could potentially become very complex, which makes this architecture difficult to test and maintain.