We need to talk, says one microservice to another
Ever wondered how we communicate? One would not believe how complex and multi-step process it is. It involves some very complex terms like perception, encoding, medium and decoding. Let us take a look at a diagram explaining this:
So what is the relation with microservices? Communication between two services is not much different. It follows through a process very similar to this, in fact it can be explained with the exact same steps! Consider two services a ShippingService written in Java requesting the preferred address of a user identified by Id, from a UserAddressService over REST a call:
Okay, so human communication can be used as an analogy to understand inter-service communication in a microservices stack. So what? So basically, it tells us that the scenarios of failure can also be same and we can understand microservice communication by relating to human communication.
Why
Before we deep dive, we should quickly consider why it matters at all in microservices. It is basically the difference between inter-process and intra-process communication. Consider this, intra-process, meaning communication between components of a single process, is like talking to one-self. You understand yourself, well (usually!), at least there is a whole less ‘miscommunication’ when talking to oneself. This is the kind that happens in monoliths, and why it was not so much of an issue until we started working on microservices, which have inter-process communication. To relate to this consider talking to people instead, known or unknown (API, authentication etc), people of different race, origin, nationality (perception), at a distance or near you (response time), on a phone vs in person (medium/network, protocol), direct vs indirect or at a different time (whoa! yeah, i.e. sync vs async, like leaving a note), speaking unknown language (encoding-decoding, JSON/XML/binary like gRPC), and of different background or upbringing (perception again) and in different mental states (well, stateless vs stateful services), telling them a secret or casual gossip (secure vs insecure). Now you know why working in teams is so difficult, and also microservices!
Types
Let us break down all the phases and see the types of potential problems. This is not some standard classification, but I have found this method effective when understanding the issues with analogies.
- Encoding: Failure to encode the object correctly by missing attributes or encoding with different name. For example, frameworks in java allow for choosing different attribute names in JSON/xml format, these are specified as strings and there is no real validation on these other than tests. Also how do we ensure if both services are using the same message structure? This can in part be ensured by sharing a common object model library. But then this goes against the principle of knowing models by views; not every services needs to know what all the attributes of a Customer object are! You might as well use queues for communication then.
- Sending the message: These include the whole set of issues which can occur due to not sending the message on correct protocol, address, port or path, the whole address related issues. Integration testing can not always help here, as some of these depend on the production infrastructure as well.
- Decoding: Being process is reverse of encoding and comes with all the same issues.
- Perception: Now these are some of the serious issues which you tend to miss even in unit and integration tests and can be caught only in later phases, when working against live services. If you encounter these, you can most certainly assume that apart from inter-service, you also have some team communication issues.
- Feedback: One of the most important step in the communication is the feedback, the acknowledgement of receipt of the message. There can be plenty of causes for a service to not respond, including all of the issues discussed above, and adding potential networking and issues related to health of the service.
- Cascading Failures: A much serious situation which is merely an outcome of failure to identify the issues in service communication and safeguarding against them.
Now that we are acquainted with the terms, I want to tell you a couple of stories; really scary, real stories.
Case study 1
I know of a Value Added Services (VAS) company, you know the services you never want yet your carriers charge you for, yeah, those. This company was one of the few trying to build a useful service you may want to pay for, trying to play fair and by the rules, even at a revenue loss. VAS industry suffers from considerable frequency of fraud and so it was said that the biggest crime is charging your users twice. Such incidents would be immediately flagged as fraud by the carriers and the service would be banned.
We had a call flow where a “Billing Service” makes a call to a “Carrier Integration Service” for charging which in tern makes a call to an external-internal Notification Service (a shared service hosted by a different unit) to send notification to user, after firing the charging request to the carrier. It worked fine for years, until one day this external-internal service suddenly slowed down, timing out all the calls from Carrier Integration Service, which in tern caused a time out on Billing Service, which treated it as a failure and ‘retried’ the billing call. The issue got flagged in minutes, but hundreds of users were charged multiple times that day before the team could respond.
There were a whole lot of things that went wrong here. There were feedback issues, which likely occurred due to network issue, giving rise to perception issues and finally causing a cascade. The worst thing though, was the cascade.
Case study 2
Another such story, not so grave, was when the Billing Service sent a request for partial charging, but the Carrier Integration Service denied honouring it. Both services used enums to identify the keyword used, both had integration tests and all worked fine. It was only after the services were tested against working instances of one another, in a pre-prod environment on a crunch day that it was identified. The issue was stupidly simple, both services used different value for the constant! Again a whole lot of things that went wrong here, most important was the perception issue caused due to the miscommunication between the teams working on these two services.
The Solutions
Discussing every step identified to fix the case studies we saw will be a story in itself, we shall discuss the first solutions applied to the most glaring causes in both cases. For the cascade, we made it a point that every service will implement Hystrix for every single inter-service communication call. Hystrix is a circuit breaker, meaning it wraps a chunk of logic (say a method) and on exception it can flag, throttle, block and bypass the method in question. The idea is to wrap the calls to external services, aka dependencies, and when something goes wrong give them time to recover by bypassing calls and safeguard the sender service from cascading the issue. We had hystrix is some services, but some teams had argued that it is an unnecessary complication for internal services and services that have been working fine. Well, that was before the incident. Everyone just jumped on it as the first change once we recovered from the impact. In my view, circuit breaker is a tool microservices should never be built without. As an additional safety net, we also ensured that the Carrier Integration Service builds a temporary cache of all the users it processed and validate against it before it fired any call (Not every solution to any problem is purely technical!).
For the perception issue, we need to ensure that post encode-decode the receiver understands the same thing as the sender meant. We had hosted stubs against which we tested the services in automation testing phase. These stubs were dumb services implementing the same API as the service they stubbed. These were developed by whichever team needed them, essentially Billing Service team would never develop the stub for the Billing Service; which caused the discrepancy in the stub and actual service behaviour. This had to change, the team developing Billing Service was to be responsible for developing the stubs for Billing Service and team for Carrier Integration Service for it’s service stubs. This way, the stubs always perceived the same as the actual service did.
Now how do we address intra-team communication problems, anyone?