

We can then use the Jaeger UI to visualize the trace as it happened, allowing us to debug problems as well as identify bottlenecks. We can solve this by tracing a request as it traverses your architecture, with each step surfacing details about what is going on and what (if anything) went wrong. All you know, though, is that your customer is complaining that they can’t access their Order details and they’re getting aggravated.

Your Auth service could be down, your Orders service could be unable to reach its database, your Shipping service could be unable to access the external API, and so on. There are quite a few things that can go wrong here. This fetches Order details, then consults your Shipping service to discover shipping status, which in turn calls an external API belonging to your shipping partner. In order for this to happen the user makes a request that hits your API Gateway, which needs to authenticate the request and then send it on to your Orders service. Now, your user has made an order and wants to track the order’s status. Let’s say you have a simple e-commerce app, which looks a little like this (simplified for clarity): A simplified e-commerce architecture Enter Distributed Tracingĭistributed Tracing is a method of tracking a request as it traverses multiple services. Time to start sifting through code…īut maybe there’s a better way. After some digging, you might be able to narrow things down to a function or two, but you’re likely not logging all the information you need to proceed from there.

How good are you logs anyway? You’re in prod, so you’ve probably disabled debug logs, but even if you hadn’t, logs usually only get you so far. Of course, if your problem only manifests in production then you’ll be sifting through a large number of logs.
#TEXTMATE HTML5 BUNDLE WINDOWS#
So what’s next? Well, you’ve got your centrally aggregated service logs, right? So you open up three or four windows and try to find an example of a request that fails, and trace it through to the other 2-3 services in the mix. You know your system architecture, and you’re pretty sure you’ve narrowed the issue down to three or four of your services. In cases like these, where you have minimal or no guidance from your configured metrics, you start trying to figure out where the problem may be. This will likely confirm to you that you have a problem, as hopefully you have one or more metrics that are now showing out-of-band numbers.īut what if the issue only affects part of your user population, or worse, a single (but important) customer? In these cases your metrics – assuming you have the right ones in the first place – probably won’t tell you much. The first place you look may be your centralized metrics service. Let’s say your 20 microservice application starts behaving badly – you start getting timeouts on a particular API and want to find out why. Log and metric collection is fairly straightforward (we’ll cover these in a separate post), but only gets you so far. There are at least 3 related challenges here: But microservices also bring challenges of their own, one of which is figuring out what went wrong when something breaks. The microservice architecture pattern solves many of the problems inherent with monolithic applications.
