Telemetry&Monitoring and Retry/Timeout Features with Linkerd

Linkerd is a service mesh for kubernetes and other frameworks. It provides security, observability, traffic management and reliability for your system. In addition to those, it doesn’t require any code changes.

Why Linkerd?

  • High observability with “golden metrics” (success rate, request per second, latency) and per-route level metrics (with Service Profiles)
  • Retries and Timeouts feature for high availability (HA)
  • Web Dashboard and preconfigured Grafana dashboard
  • Traffic Splitting for Canary and Blue/Green deployments (With SMIs Traffic Split API)
  • Load Balancing
  • Easy installation and fast proxy injection
  • Doesn’t have Ingress Controller but can integrate with your current Ingress Controller

In this lab we’ll deploy a sample application and explore some of these features for observability and HA.

Prerequisites

For this lab you’ll need linkerd installed on your cluster. Follow the instructions in this guide if you didn’t already.

Install Application

Sample application consists of 2 services (pizza-client and pizza-server). Pizza client service generates traffic to pizza server and pizza server returns response with status code 200 and 500 randomly (to simulate failures). Clone repository and apply YAML files under manifests/app folder to deploy application.

Note: Pizza server simulates latency by waiting randomly (max 10s).

git clone https://github.com/trlogic-developer/pizza-app.git
cd pizza-app
kubectl apply -f manifests/app

After that you can see our applications deployed in app-space namespace. Check the pods, linkerd automatically injected proxy containers to them. See the ready column.

kubectl get pod -n app-space

If linkerd injected sidecar proxies correctly, you should see “2/2”.

Linkerd did that because app-space has “linkerd.io/inject: enabled” annotation in its metadata. There are other annotations you can use for overriding data plane proxy default configurations like:

  • config.linkerd.io/proxy-cpu-limit: Maximum CPU proxy car can use
  • config.linkerd.io/proxy-memory-limit: Maximum memory proxy car can use
  • config.linkerd.io/proxy-version: Tag of linkerd proxy image

For more about these config annotations visit this page.

Web Dashboard and Grafana

Besides its command-line interface, linkerd provides a web dashboard and preconfigured Grafana dashboard. Use them to understand what’s going on in your system.

Expose web dashboard with command:

linkerd dashboard &

In Overview page our namespaces and resources in them (pods, deployments etc.) are listed. View so called “golden metrics” like success rate, request per second and latency for your resources here.

Click on pizza-server in Deployments. You can see source of the requests, best/worst response time, service-to-service communication model here. Examine the model to detect source of system failures. In this case pizza-client deployment fails because of pizza-server deployment.

Return to the linkerd web dashboard and open Tap page. You can view every request to/from the resources individually. Select app-space namespace and deployment/pizza-server resource then click Start to examine incoming request to pizza-server.

On Service Mesh page you can view linkerd control plane components statuses, service mesh details like version, namespace, proxy count and meshed pod statuses of namepaces.

Return to the Overview page and click to little Grafana dashboard icon to the right of the pizza-server deployment. You can view system metrics, add/edit charts based on your needs and share/export information with others.

Retries and Timeouts

Linkerd can automatically retry failed requests. While this is a powerful feature, if it is configured incorrectly it can convert small errors to system wide outages. (Check out Retry Storm) To limit that risk linkerd uses Retry Budget approach. Rather than using an arbitrary and fix number for limit retry count per request, linkerd uses a ratio which can be configured. Let’s say you set that ratio to 0.3. Linkerd makes sure that retry attempts adds maximum %30 more request.

Let’s try this feature to increase our success rate. Before we tell linkerd about retries and timeouts, we need to define routes of our pizza-server service (It has one POST / route only). To do that we’ll define a Service Profile.

kubectl apply -f manifests/pizza-server-service-profile.yaml

I left the comments in file explaining every spesifications role. Check the file if you want. Basically it defines the route for pizza-server service, enables retry and limits retry count with retry budget.

Defining Service Profile provides linkerd per-route level metrics — in addition to enabling features like retries and timeouts. Wait for a little while for traffic generate and view metrics by “linkerd routes” command.

linkerd routes deploy/pizza-client -to deploy/pizza-server -n app-space -owide

-owide flag adds additional fields to the result so we can compare how well our retry configuration works. Difference between EFFECTIVE_SUCCESS and ACTUAL_SUCCESS tells us how well it works. EFFECTIVE_RPS and ACTUAL_RPS are how many requests being send to destination and how many received by it.

In our case success rate increased to 98% from 76.5% but our effective rps dropped to 1.7 rps and p_95 p_99 latencies increased because of retried requests.

You can top that route from linkerd web dashboard now. Go and check that from Top Routes page.

Now check Grafana dashboard after waiting a little bit. For pizza-server deployment you can see that when success rate decreases request rate increases. Linkerd retries failed requests.

Conclusion

As we saw in this lab, linkerd provides great observability with its CLI, Web Dashboard and Grafana dashboard. It can collect important metrics per-route level.

Also you can use its features for HA in production environment. In addition to Retries and Timeouts, linkerd has a HA mode feature for itself. Check out the links in additional reading section for further exploration about Linkerd and Service Mesh.

Additional Reading

https://buoyant.io/2017/04/25/whats-a-service-mesh-and-why-do-i-need-one, What is a Service Mesh and Why Do I Need One

https://linkerd.io/2/features/ha/, Linkerd HA mode

https://github.com/trlogic-developer/pizza-app, Sample application github repo