Designing a library for inter-service communication in SOA

Published in

Nerd For Tech

16 min readAug 6, 2021

My previous article was about additional complexity that appears when one splits a monolithic system into parts. The basic idea is that, regardless of the underlying communication infrastructure, the SOA system has an additional “communication complexity” which is basically the extra knowledge about the interfaces of the services. If, for example, service A communicates with services B and C, an engineer responsible for A must have partial knowledge about B and C: she knows their interfaces, she is informed about changes, maybe even has a grasp of some implementation details. Each pair of communicating services contribute to the overall complexity of the system. In a big software system with quite a dense communication network it leads to two big problems:

Lots of complexity resides in the communication layer and it increases quickly when adding new services.
Engineers “can’t see the forest for the trees” — they operate with service-level models only and can’t see a big picture of the system as a whole.

Obviously, isolation of services has lots of advantages, if it hadn’t there is no sense in SOA at all. The main benefit is an encapsulation of a part of the system into a separate unit with a thin interface to the rest of the system. Thus, a group of engineers may focus on the component’s internals, with minimal interference with the rest parts of the system. They have an understandable scope of work, can deliver faster, deploy independently, and so on. Being isolated from the system, engineers have flexibility in design decisions, not only in terms of organizing code but also in terms of “ubiquitous language” they create in the service’s “bounded context”. The main problem with this service-related vocabulary is that it might evolve too independently of other vocabularies in the system. After a couple of years of such evolution, one may notice that similar abstractions have different names and behavior in different services, that there are data duplications between services, and that “translations” between these ubiquitous languages become more sophisticated. Moreover, a common “system language”, that existed in some way in the initial monolith, gradually disappears from code and moves to the documentation or just to the minds of a couple of experienced engineers and product owners.

Therefore, there are two facets of the solution to this “communication complexity” problem. The first one is about designing a communication layer in a way that reduces and simplifies “translations” between “ubiquitous languages” of different parts. The second one is about common ubiquitous language or system ubiquitous language or, simply, just system vocabulary.

As it was discussed in the previous article, having a common well-designed approach for communication inside a system may help with both aspects. And this article is about a library (package, gem, whatever) that serves two purposes:

Defines a system-level model: all the abstractions (data structures and commands) that describe the system’s functionality as a whole.
Encapsulates the communication layer providing a universal interface to the system.

A bit of context

It is much easier to explain the idea using real-world examples. I was a part of a great team that designed and developed a system that automates an insurance agency. And the rest of the article is about the challenge we had with our SOA system, which was growing too fast from the very beginning. It started as a monolithic Rails application managing customers by insurance agents: requesting insurance quotes for them and issuing insurance policies. Then a separate Distribution project appeared which was used for managing customers’ data and integration with big enterprises for the specific distribution channels. Then the first application was split into two: Agency Management Platform (AMP) and integrations with insurance carriers (Quoting). Over time, while new sources of distributions appeared or new third-party services were integrated, the system had grown to the size of around 10 different services.

But since it happened quite quickly, during 1–2 years, not all things went smoothly.

First, data duplication. Lots of information were in some way duplicated among different parts of the system. For example, basic abstractions like “person”, “property”, “lead”, “quote”, “insurance” existed in almost every service. And each service presented, stored, and managed corresponding data in its own “bounded context”. Such a duplication wasn’t a big issue for the system functionality, but it definitely was bringing confusion when one tried to understand the details. The problem became more evident when we started adding a business intelligence layer to the system. It was hard to properly aggregate data from different services. We even introduced a concept of “gid” (global id) for some entities in order to simplify the process of linking dispersed data.

Second, non-standardized API between the services. Most API interfaces were designed ad-hoc for each pair of communicating services. There were no client libraries, each service had to handle JSON data on its own. Only documentation, periodic discussions, and detailed testing helped to avoid mistakes.

And, finally, gradually decreasing engineering awareness. With the growing system and increasing number of engineers, it was a natural process of specialization. An engineer was aware of 2–3 services he or she actively worked on and had quite a vague idea about other parts of the system.

But, there was good news also. All services in the system were written in one programming language — Ruby. The services communicated mostly synchronously via REST API calls, with a very small fraction of asynchronous events. And, the whole engineering team was relatively small at that time (about 10 engineers), so at least human-to-human communications went smoothly.

Plans for huge refactoring

The whole refactoring epic consisted of three big parts. First, we had to reorganize data, then the library itself should have been designed and implemented and, finally, integrated into all the services. Each phase required lots of effort. Both data reorganization and actual integration of the library into each service required significant changes in the domain and application layers. The first (reorganizing data) task affected the data layer of almost every service in the system. And the last one (integration of the library) required significant changes or even reimplementation of some services since a significant part of the domain and application layers of every single service should be changed.

The library itself is relatively simple from the technical point of view, but since it implements an interface to every part of the system, you can imagine how many meetings with all stakeholders it caused.

The good news is that you don’t need to refactor the whole system at once. It is possible (and recommended) to start from a small service in your system: define data that the service is responsible for, refactor its interface a bit, implement related “models” and “actions” in the library, and use the library in all the other services that need to communicate with the service. It should be an evolutionary process for the existing parts of the system, but, if you are going to build a new service, it’s better to use the new approach from the very beginning.

But let’s review each step in details

Put your data in order

The data related to each particular business entity must be stored and managed in a corresponding service. The service has complete responsibility for the entities’ data, and, therefore it becomes a single source of truth about the actual state of the entity (“master service” for the entity). Other services (“consumers of the entity”) initialize it and make changes only via using the responsible master service. In some cases, a consumer may have local copies of some entity’s data (like cache) or even extend the entity with additional attributes that are relevant only in its bounded context. But all these modifications should be encapsulated only inside the consumer service and shouldn’t be exposed to the system. Any changes that are visible for the rest of the consumers must be done via the master service.

Such a separation of data might be a quite complex problem and involves all stakeholders, both engineers and product owners. To approach the problem one should keep in mind that in a software system there are always two kinds of data. First are “external data” — data that represent the “world” around the system. Good examples in our system are “person” and “property”. People (and their properties) exist irrespective of the system, so the best way the system may handle such data is to be sure that they correspond to the reality of the outside world. And “internal data” are actually all the other data that are generated by the system itself. It doesn’t mean that these entities have nothing in common with the “real” world. They do have, e.g., abstractions like “lead” or “insurace request” definitely exist in the business domain of the system. But these data are the products of the system and the business that operates with the system.

Let’s take a detailed look at the “person” — “candidate” — “lead” trio in our system. The business logic is the following. A customer (person) may be interested in buying a new insurance policy and he submits a form on a company page. Two entities appear in the system: “person” and “candidate”. Then the system checks if it is possible to find a better (cheaper) policy for the customer and, if so, the “candidate” transforms to “lead” and goes to another service in the system. Thus a customer as a human being is represented as a “person” in the system and these are “external data”. At the same time, “candidate” and “lead” are entities created by the system.

Actually, dealing with “internal data” in our system was not a big problem. We had a very clear separation of responsibility between services, and therefore these data were initially stored and managed in a corresponding service. The real problem was related to the “external data”: person, property, etc. These entities were presented (in a different way) in almost every service in the system. And the solution was quite straightforward — we just moved all these “external” abstractions into one single service — “origin”. The “origin” service is not just simple storage of customers, properties, and other entities. In fact, it is one big model of the outside world for the whole system. All the entities, relationships between them, and interfaces for accessing and modifying them are in one single place.

After the data migration is completed, one has a good basement for defining basic data structures in the system. Then, they will be placed into the library to be observable and accessible inside any service of the system. These data structures, their names, and names of their attributes will become “nouns” of the system and cover the whole set of concepts, entities, states, etc.

It is most likely that after the reorganization of data the interfaces of services will also require revision. And, most probably, they will become simpler. Having each entity in its own place, you may refer to it using id or uuid. So, if previously you had to pass lots of attributes of entities between services, now you can just pass their ids. A service then can fetch all the necessary data by itself. There is, of course, a performance-simplicity tradeoff, but we are on the simplicity side in this article.

Remember that we are going to encapsulate the whole communication inside the Ruby library, and, in the end, we will communicate using plain Ruby objects and their methods. Of course, under the hood, it will be (mostly) REST API which has different semantics (URI, HTTP verbs, status codes). These nuances should be taken into consideration, but, first of all, one has to focus on Ruby and human language when choosing names for objects and methods in the library. REST semantic has lower priority, it’s just a part of the infrastructure layer that will be hidden inside the library. That is, basically, it. Having properly organized data and corresponding interfaces we are ready to create a library that wraps all that by simple and beautiful Ruby objects.

Meet Palantir

Let’s recap the basic idea behind the library. There are two main purposes: define system-level models and encapsulate inter-service communication.

The structure of the library is illustrated below:

“models”, “validators”, serve the first purpose, while the rest of the abstractions relate to the communication layer.

Models

Models are data structures with simple validations. All the model classes are inherited from Palantir::BaseModel.

module Palantir
 class BaseModel
   include ActiveModel::Validations
   include ActiveModel::Serializers::JSON
   
   attr_accessor :id   ...  end
end

So, the only extra-functionality are validations and serializations. And each model is actually a set of attributes with corresponding validations.

module Palantir
  class Person < Palantir::BaseModel
    ATTRIBUTES = %i[email first_name last_name ...].freeze
    attr_accessor(*ATTRIBUTES)
    
    validates :email, format: Validations::Patterns::EMAIL
    validates :first_name, presence: true, on: :create
    validates :last_name, presence: true, on: :create
   end
end

The interface of a model consists of the “build” class method (inherited from Palantir::BaseModel) which, builds the model from the attributes, and several instance methods for validation and serialization.

The idea of keeping a tiny validation layer in Palantir is twofold. First, the validation statements provide additional information about the possible values of the attributes, in other words, define type more strictly. Second, these validations are applied before each request and, if there are errors, the request will not be sent, but an error will be returned.

ActiveModel::Serializers::JSON module just simplifies the json serialization routines.

Actions

“Action” is a common word for each communication act in the system. Its meaning differs a bit from the “request”. A request is more about the communication between two particular services. But since Palantir conceals the communication parties, it’s better to use the “action” word and think about communication as an action (business-action) in the system.

There are both synchronous and asynchronous actions in the system, and we introduced a strict separation between these two types of communication. Synchronous requests are the basic way of communication, the business logic is built on top of them, any “significant” changes to a business entity are done synchronously. Asynchronous messages just inform the system about “minor” changes. These messages also require some processing in one or more services, but they don’t affect system behavior a lot. For sure, the distinction between significant and insignificant operations should be made on a business level, but a good rule of thumb is that the system should remain functional only with synchronous messages, while asynchronous ones control some non-critical aspects. Or, more technically, one may ask two questions: do we need to be sure of successful message processing, and what happens if the message is lost. If you don’t care about the result of communication and system consistency will not be broken if the message disappears somewhere on its way then an asynchronous request is a right choice.

There are two types of synchronous actions in the system: “question” and ”command”. The idea behind the separation is to keep semantic of the GET (questions) and POST/PUT/DELETE (commands) HTTP requests. A question expresses the need for some data — a service “asks” the system about the data. While commands tell the system to do something or/and change data in the system. In terms of data flow in the system, questions are about data flowing from the system to a service, while commands send new data to the system.

And “statements” present asynchronous communication.

The interface of “actions” mimics a common approach to RPC (remote procedure call), each action has corresponding “request” and “response” objects which are composite objects containing entities needed for performing communication. Each action is a class with just one “call” method which accepts a “request” object and returns a “response” object. Here is an example for the Palantir::CreateProperty command.

request = Palantir::CreateProperty::Request do |request|
  request.property = new_property
  request.property_address = new_property_address
endresponse = Palantir::CreateProperty.(request)
response.class # Palantir::CreateProperty::Response

Note the naming convention for commands, it’s a verb with the object (noun) — minimal semantic information needed for understanding the command purpose.

The request object contains the logic for validation and serialization of data:

module Palantir
  class CreateProperty < BaseOrganizer
    class Request < BaseCommandRequest
      SERIALIZER = PropertySerializer
      RELATIONS = {
        property: { required: true },
        address: { required: true }
      }.freeze
      ...
    end
  end
end

While the command itself defines all the necessary steps needed to perform actions and also has information about where to send the data:

module Palantir
  class CreateProperty < BaseOrganizer
    organize Common::SetDefaults,
             Common::Validate,
             Common::Serialize,
             DoRequest,
             Common::Deserialize,
             Common::CheckStatus,
             Common::BuildProperty,
             Common::BuildAddress    def self.endpoint
      'properties'
    end
 
    def self.location
      Locator.origin_location
    end    def self.call(request)
      ...
      super(request: request, organizer: self).response
    end
    ...
  end
end

We use a great interactor library for defining intermediate steps. And one can see what actually happens under the hood. First, some default values might be populated for action’s objects. Then validations happen, (client-side), and, if something is wrong with data, the error response will be returned immediately. If everything is fine then an actual HTTP request is performed (Common::Serialize -> DoRequest -> Common::Deserialize). Common::CheckStatus verifies the status of the response and may stop the pipeline with an error response. And, finally, the two last steps build Property and Address objects.

The response class defines the data one must expect in the command output. In the given example it is quite simple:

module Palantir
  class CreateProperty
    class Response < Palantir::BaseResponse
      attr_accessor :property, :address
    end
  end
end

So one may be sure that the response object will have “property” and “address” objects which are built after the command has succeeded:

response.property # Palantir::Property
response.address  # Palantir::Address

That is basically it about the main abstractions in the library: “models”, “actions”, “requests/responses”. There are a couple of utility components (RestApiClient, MqClient, Locator) that implement low-level communication and service-discovery logic, but these are not worth to be considered in the article.

The library inside a service

When you integrating the library into a particular service a question appears: may one use the library models directly in the service domain/application layers or should they be wrapped by other objects? For simple use-cases, it is ok to use the library classes directly. In more complex cases, inheritance or composition may be used. But, in general, the library is a part of the system codebase, not a third-party one, so it should be considered as a “native” part of the service code.

Other difficulties relate to the communication layer: every action takes some time to complete and a request may fail (network issue or other server has crashed). For sure, slow requests should be treated in some way on the service side, and one may think about concurrent processing of such actions (e.g. background processing). A better solution is to change the logic and do a couple of quick requests instead. For example, instead of the DoSlowRequest command, it is better to start the process and then get the result when it is ready: StartProcess command, ResultReady statement, GetResult question.

Final remarks

System domain layer and common ubiquitous language

In terms of layered architecture Palantir partially occupies infrastructure and domain layers. On the infrastructure level, the library is responsible for delivering messages between the services. While on the domain layer it represents a “system domain layer” — the common part of the domain that represents basic concepts of the business. The domain layer is the most important part since it defines “ubiquitous language” that everyone will follow, and it’s hard to change it in the future. Therefore having the system domain layer as a separate piece of software, that can be reused in every part of the system, simplifies lots of things. Engineers may just focus on the application layer — a set of tasks or jobs that should be performed on top of the system domain layer.

Each service extends the system domain layer by its own abstractions for specific business needs. If new concepts are just local ones (make sense only in its bounded context), they will just complement the system domain layer and will exist in code alongside the library’s abstractions. But if a newly introduced concept makes sense for the whole system and will be used in other parts of the system then it must be included in the library itself. In any case, both “local language” and extensions to the “system language” are built on top of well-defined “common ubiquitous language” that everyone speaks and understands.

Monolithization

One may argue that with the Palantir library approach we actually returned back to a monolith. That’s partially true. For now, each service depends heavily on the common abstractions that permeate all the code. Any changes in the library will lead to numerous changes in corresponding codebases, slowing down system development, and diminishing advantages of SOA approach.

But let’s review again on what level this coupling occurs. It is on the level of common data structures and actions — a well-discussed and well-designed core of the system. It’s a product of collaborative architectural decisions that are (as any such decisions) hard to change in any case. Such a “system-level” coupling is impossible to avoid in SOA: splitting a system into parts decouples service-specific functionalities, but there are always common things that affect two or more “decoupled” parts.

To put it philosophically, the understanding (perception) of the system as a whole (not just as a set of its parts) is not possible without the common abstractions that link its parts.

Consider a simple example, we need to introduce a new attribute in a common data structure, and logic in many services should be changed in order to support the attribute. Or imagine a complex case when significant changes are made to an existing command interface. In both cases, the new functionality should be first implemented in the library and then all the dependent services will be updated.

But, without the library, one would do the same. The new data structure will appear somewhere in the system, the interface of the responsible service will be changed and then the code of all other parts will be changed. Actually, one would go through the same actions even in a monolithic application.

Therefore, yes, the system is coupled by the library, but it is coupled “naturally”, based on common things, where it was and will be coupled in any case. At the same time, without Palantir, the interdependencies inside our system were expressed somewhere in the documentation, API specifications, and kitchen conversations. But now they are explicitly defined in the code.

Heterogeneous systems

The described approach is possible in systems written in one programming language, but what if several languages are used in a system. There are a couple of possible scenarios.

The first quite popular one is the system with a “core” language used in most critical parts of the system, and there are several utility services that are written with other languages. In this case, it’s absolutely okay to have a Palantir-like library for the system core. One might even have its rudimentary variants implemented in other languages if necessary.

The second case is a “zoo-of-technology” or “tower-of-babel” system that might be described in only one language — English(mostly) and only in the documentation. Probably, it is possible to describe the core behavior with a single meta-language and automatically generate client libraries for each service (similar to gRPC framework approach), but I have doubts that this will simplify things. The most appropriate solution for such cases is to create a single point of communication (like API gateway) and “monolithize” the system through its interface. The interface of the gateway serves the same purposes as the library — all the common abstractions (nouns and verbs) will be defined there in one single place.

Another quite advertised approach is the Event Sourcing pattern that basically does the same — it builds a common language on top of the communication layer. But it does this in a tricky way — by representing system logic in terms of “domain events” that occur in the system, and each service “reacts” to such events. The approach is more complicated since it brings a reactive paradigm to the system consisting of services with a (mostly) “active” approach inside. The language of events differs a lot from RPC-like language used in services. It’s more about changes inside the system that implicitly affect the state of the system, instead of declarative actions that change the state directly. This way of thinking differs a lot from what we have in conventional programming, thus one should think carefully when choosing it.