Using Plumbr with the OpenTelemetry NodeJS Agent


August 5, 2020 by
Vladimir Rive
Filed under:
knotJS
plumber
Product Updates
Trace

Every now and then we are asked how well Plumbr supports / integrates with upcoming open source distributed tracing agents. Our response has always been “Yes, this integration will be possible one day, when the agents and their data communication protocols are sufficiently mature”.

Then, that “someday” happened …

About a month ago, one of our customers pointed out to us that they really enjoyed detecting Plumbr uptime and performance issues in their frontend and backend, but placing a nodeJS application in between these two prevent Plumbr from capturing the full traces. Here’s what the high-level architecture of their application looked like, with Plumbr monitoring user experience in the browser and API performance on the JVM backend:

The customer then added that he would be willing to consider using the OpenTelemetry nodeJS agent, if it is related to scopes reported by Plumbr agents. And that’s when we thought – with the OpenTelemetry project gaining momentum and becoming more mature – why not give it a try.

The purpose of the OpenTelemetry project is to create vendor neutral tracing agents and APIs. His OpenTelemetry JS the sub-project also includes the nodeJS agent. Although at the time of writing (July 2020) the OpenTelemetry agents have not yet been officially released, we still thought that this could be a good opportunity to test the integration between the Plumbr and OpenTelemetry agents. After all, we only needed a very minimalist functionality of the agent. Specifically, we needed it to start the span at the boundary of an HTTP request and propagate the trace context through the HTTP headers. It already seemed to be working pretty well.

We decided to give it a go and create a POC project that integrates the data collected by the OpenTelemetry agents (nodeJS) with the data collected by the agents who own Plumbr.

What should we do?

To use a third-party agent, we need to integrate three aspects. First of all, we need to receive the data collected by the agent. Second, we need to process the data in a way that works well with our existing data processing pipelines. Third, the agent must be able to join a current trace and propagate the trace context downstream in a way that our other agents understand.

OpenTelemetry itself suggests installing the so-called collector on each monitored machine. The agents send the monitoring data locally to the collector and the collector manages the efficient and reliable transmission of the collected data to the true processing backend. You can choose between different communication protocols between the agent and the collector, and the same for communication between the collector and the data processing backend – a piece of integration code called an exporter.

We could have implemented an exporter to convert OpenTelemetry scopes to our own binary representation. But that would force our customers to use the collector, which might be overkill for some simple deployments. Instead, we chose to support the Zipkin protocol in our receiving data backend. This choice has several advantages for us. First of all, Zipkin can be used by the OpenTelemetry JS agent to send data both locally and remotely, which means we wouldn’t have to use the collector anymore (but would still be able to do so if necessary). Second, once we start receiving scope data in Zipkin format, we can easily add support for other agents that can export scope data in Zipkin format.

The data processing pipeline benefited from the good design decisions made during the development of our universal agent. This turned out to be really universal – once the raw Zipkin data was converted into our binary format, we only had to add new labels in the user interface!

The most interesting part was making the agent itself part of our trace context propagation. The OpenTelemetry agent has some great integration points that can be used to customize the headers that will be used to convey the context over the thread. Plumbr and OpenTelemetry both use very similar representations for trace id and scope id, so we just needed to format them differently before transmitting / receiving over the wire.

Using Plumbr with the OpenTelemetry nodeJS agent

We ended up with a very lightweight NPM package that hides the multiple OpenTelemetry dependencies required, adds the aforementioned integration hooks, and makes installation and configuration as easy as other Plumbr agents.

How to use it? Simple! Add OpenTelemetry-js-plumbr as a dependency to your project:

npm add bitbucket:plumbr/OpenTelemetry-js-plumbr

Add a file named plumbr.js to your project with the following content:

'use strict';
const { initPlumbrTracing } = require("OpenTelemetry-js-plumbr");
initPlumbrTracing(
   {
      apiKey: "<API key>",
      serverUrl: "https://app.plumbr.io/",
      clusterId: "<Cluster ID>",
      serverId: "<Server ID>"
   }
);

Or

  • The comes from your Plumbr Account Settings page
  • will be used as the API name for traces collected in Plumbr.
  • will be used to identify a server in the cluster (optional, defaults to cluster ID).

Update your startup script to add two parameters to the node – node -r ./plumbr.js ...<the rest of launch command>

Please note: if you are using the esm module, it must be included before plumbr.js, or we will fail to start: node -r esm -r ./plumbr.js ...

Final result

After gathering our POC and doing extensive internal testing, we shipped it to the client mentioned above. The first external tests were successful, exposing the bottlenecks detected in the JVM application entirely related to the user interactions in the browser that were affected. The NodeJS layer in the middle was no longer an issue and appeared as part of the traces, allowing observability across the entire tech stack.

See how the scopes of a user interaction in a browser are displayed by Plumbr and notice how they move through a nodeJS layer to the JVM (the last two lines of the scopes list):

With wider deployment, the new integration now already handles over 100,000 user interactions per day. If you have a similar deployment, why not go ahead and try our new agent here.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *