JVM forensics vs APM: why both, when each, and where each stops

The compatibility story (short version)

Every major APM tool uses the Java Agent API (-javaagent:) plus bytecode instrumentation. That mechanism is standard OpenJDK; Eliya doesn't modify it. APM agents coexist with Eliya the same way they coexist with Corretto, Temurin, or Zulu:

java -javaagent:/opt/your-apm-agent.jar \
      -XX:EliyaProfile=Production \
      -jar app.jar

The agent does its instrumentation; Eliya activates its operational-readiness defaults. Both run side by side. No conflict at the JVM level. What matters is which tool answers which question, and where each stops.

What APM tools do well

APM tools are real-time observability platforms. They excel at:

Distributed tracing: following a request across service boundaries, with timing per hop.
Business transaction analysis: aggregating performance data by user-facing operation rather than by JVM internal.
Real-time dashboards and alerting: threshold-based pages and Slack notifications when latency or error rate moves.
Service-map visualisation: topology, dependency graphs, request flow across a microservices estate.
Cross-service correlation: one trace ID stitches frontend, backend, database, and message-broker spans together.

Most APM vendors do not publish a single official overhead figure; it depends heavily on instrumentation depth and configuration. Where vendors publish, and where they do not:

APM	Published overhead	Notes
AppDynamics	0–2% CPU, ~10–100 MB (plus ~100 MB heap) [docs]	Official figure published
Dynatrace OneAgent	1–3% CPU design target, ~200 MB budget [docs]	Most aggressive instrumentation; separate native agent for system metrics
Elastic APM	Single-digit-microsecond latency, a few MB [docs]	Open-source; publishes a latency figure, not a CPU/memory band
New Relic	No official figure published	SaaS only; docs describe a memory circuit-breaker, not a baseline overhead
Datadog APM	No official figure published	SaaS; the separate Datadog Continuous Profiler is built on JFR
OpenTelemetry (Java auto-instrumentation)	No single official estimate; its own page says to measure (community reports ~2–5% CPU, ~50–100 MB)	Open standard; same structural limitations as commercial APM

The point is not a precise number. Any in-process agent adds measurable CPU and memory, and the amount varies enough by configuration that you have to measure it in your own deployment.

What APM tools structurally cannot do

The structural limitation of APM is the architecture itself: agents sample, aggregate, and ship summaries to a backend. The raw artefacts production incidents require are not in that data model. Six specific things APM cannot do:

Capture textual artefacts for offline analysis. HPROF heap dumps for Eclipse MAT, jstack-format thread dumps for fastThread / TDA, raw GC logs for GCEasy. APM tools don't natively structure these formats on local disk for your own offline tools. Some heavy enterprise APMs (Dynatrace, AppDynamics) provide a UI button to trigger a heap dump, but they often struggle with container filesystems, require proprietary proxies (e.g. Dynatrace ActiveGate) to download the file, and trap the analysis inside their SaaS platform rather than producing a standard .hprof on a predictable path. Elastic APM's issue tracker has explicit user requests for native heap-dump capture; the answer is "use jmap, jcmd, jstat directly."
Replay JVM history at event-level granularity. APM samples (every 10s, every 60s); aggregates (p50, p95, p99); ships summaries. JFR captures every event (allocation, GC, lock contention, method execution) for the configured rolling window (24h default in Eliya) at sub-1% overhead. When a rare incident occurs, JFR has the actual events; APM has only the metrics that survived sampling.
Diagnose native memory leaks. APM watches the Java heap. Leaks in JNI code, DirectByteBuffer allocations, Metaspace growth, code cache pressure, thread stack accumulation, and third-party native libraries (Deflater, image processors, crypto) live outside the heap. Native Memory Tracking (NMT) is the only standard JVM tool for this; Eliya enables it by default.
Survive JVM crashes. When the JVM segfaults, the APM agent dies with it. Anything in the agent's memory at the moment of crash is gone. The JVM crash log (hs_err_pid<PID>.log) is written by JVM crash machinery before the process terminates: register state, all thread stacks, loaded libraries, signal info, JVM internal state. APM has nothing equivalent.
Reach JIT / compilation-event detail. APM shows latency moved. JFR shows the C1→C2 promotion that caused it, the deoptimisation event, the code cache pressure that pushed a hot method out, the monomorphic inline cache that lost its assumption. APM aggregates around symptoms; JFR records causes.
Work in air-gapped or data-egress-restricted environments. SaaS APM requires continuous outbound connectivity to a vendor backend. BFSI settlement systems with regulatory restrictions on production data leaving the perimeter, healthcare integrations under HIPAA, government and defence deployments under classification controls, air-gapped industrial control systems, sovereign-cloud deployments under data-localisation laws. All of these break the SaaS APM requirement. Eliya runs entirely inside the perimeter.

Ten production incident-response use cases

These are scenarios where Eliya's captured artefacts solve problems APM tools structurally cannot. Each is a real incident pattern, not a contrived example.

1. Memory leak investigation requiring heap analysis

Production app OOMs every 18 hours. APM shows heap pressure but cannot identify what's holding references. With Eliya, HeapDumpOnOutOfMemoryError is configured by default; when OOM occurs the heap dump lands at /var/log/eliya/${service}/${replica}/heap/java_pid<PID>.hprof. Eclipse MAT's Leak Suspects report identifies the dominator tree and retainer paths. APM gave a graph saying memory grew; Eliya gives the dump telling you which objects, which classloader, which paths.

2. Thread starvation / deadlock investigation

Service stops responding; CPU at 100%; APM shows latency exploding but can't say why. Thread dumps via jcmd <pid> Thread.print show every thread's state and stack. Multiple dumps 5–10 seconds apart show which threads moved and which are stuck. Real example: Apache PDFBox deadlock where two threads acquired locks in opposite orders, hanging an application that APM only saw as "high latency." Thread dumps showed exactly which two threads, which locks, which call stacks.

3. Native memory leak requiring NMT analysis

Container memory grows steadily despite stable heap. Eventually OOMKilled by Kubernetes. Heap analysis shows nothing because the leak isn't in the heap. NMT summary mode (enabled by EliyaProfile=Production): jcmd <pid> VM.native_memory baseline then later jcmd <pid> VM.native_memory summary.diff shows which regions grew. Real example: Twitter exception logger compressing messages with native zlib library caused service-killing memory growth; 94% of leaked blocks traced to Java_java_util_zip_Deflater_init. No APM would have caught this.

4. JIT performance regression investigation

v1.4.2 was fast; v1.4.3 is 30% slower with the same code. APM shows the slowdown but cannot tell you which methods compile differently. JFR's compilation events (recorded by default in Eliya) show which methods entered C1 vs C2, which were deoptimised, which hit code cache pressure. APM showed latency increased; JFR explains why: a specific virtual call lost its monomorphic inline cache, or tiered compilation policy shifted.

5. GC pause investigation

P99.9 latency spike at 04:17 UTC. APM shows the spike. Cannot show what GC was doing. GC log rotation (a Production-profile default from Phase 2) captures every collection event with pause duration, reason, live set before and after, algorithm phase. JFR's GC events add allocation pressure, promotion failures, humongous allocations. APM cannot reach this level of detail; GCViewer, GCEasy.io, or JClarity Censum can analyse the raw log Eliya produced.

6. Crash investigation

JVM segfaults; container restarts; APM agent died with the process and no signal reached the backend. Eliya writes hs_err_pid<PID>.log to /var/log/eliya/${service}/${replica}/crash/: register state, crashing thread's stack trace, all loaded libraries, environment variables, signal info, JVM internal state at the moment of crash. Without this file the crash is unrecoverable knowledge.

7. Compliance audit trail

Auditor asks: "Show me evidence that billing-service in production didn't crash or experience memory pressure during March 2026." With Eliya, the per-replica directory tree under /var/log/eliya/billing-service/ contains jfr/, gc/, crash/, heap/ for each replica. Filesystem timestamps and contents are the audit record. APM provides dashboards an auditor cannot directly verify; Eliya provides files on disk an auditor can independently inspect.

8. Customer post-mortem support

Customer reports their integration with your service failed at 14:23 UTC. Was your service degraded? JFR recordings from 14:00–14:30 UTC are on disk for that exact pod. You can ship the JFR file to the customer for joint analysis, or analyse it yourself. APM dashboards covering that period are aggregated and you cannot share the raw events with the customer.

9. Multi-region incident correlation

P99 latency degraded in eu-west-1 between 09:00 and 11:00. Did the same happen in us-east-1? What was different? JFR recordings from both regions are on disk in identical format because the data is JFR-native. Compare GC behaviour, compilation events, allocation patterns directly. APM aggregates metrics differently across regions depending on deployment; JFR has consistent semantics regardless of where captured.

10. Cross-artefact correlation (Phase 3 Asymm Forensics preview)

JFR alone tells you what events occurred. Heap dumps alone tell you what was retained. Thread dumps alone tell you what threads were doing. GC logs alone tell you what the collector did. Crash logs alone tell you the moment of failure. Asymm Forensics correlates them. Was the OOM preceded by a specific allocation pattern (JFR + heap dump)? Did the GC pause correlate with a specific lock contention pattern (GC log + thread dump)? Was the crash preceded by JIT pressure (crash log + JFR compilation events)? This is what no APM does: APM correlates metrics; forensics correlates artefacts. The artefacts must exist on disk for forensics to work. Eliya makes them exist on disk by default.

Coexistence patterns

Three concrete configurations covering the realistic deployment shapes.

Eliya + SaaS APM (Datadog example)

Both agents attach; both work. Datadog captures distributed traces, business metrics, and APM views. Eliya's JFR captures detailed JVM-level profiling locally. The interesting note: Datadog Continuous Profiler uses JFR under the hood, so Eliya's JFR activation feeds Datadog's profiling pipeline directly. Because Eliya guarantees JFR is safely configured and running by default with sub-1% overhead, it eliminates the setup friction Datadog customers otherwise hit when enabling Continuous Profiler. Eliya is the reliable engine; Datadog is the visualisation pane. Heap dumps on OOM (Eliya) additionally preserve evidence Datadog couldn't have captured retroactively.

java -javaagent:/opt/dd-java-agent.jar \
      -XX:EliyaProfile=Production \
      -jar app.jar

Eliya + self-hosted APM (Elastic APM with on-prem stack)

Hospital IT runs their own Elasticsearch + Kibana + Elastic APM Server in their datacenter; data never leaves. Adding Eliya is purely additive: Elastic APM gives the distributed view (cross-service traces, dashboards); Eliya gives the deep per-JVM forensic data (JFR recordings, heap dumps, GC analysis) that Elastic APM doesn't capture. Different analysis surfaces, same compliance posture.

java -javaagent:/opt/elastic-apm-agent.jar \
      -XX:EliyaProfile=Production \
      -jar integration-service.jar

Eliya alone (compliance-restricted environment)

A BFSI settlement system under regulatory data-egress restrictions: production data cannot leave the perimeter; SaaS APM is therefore prohibited. The team historically had JConsole for occasional inspection, ad-hoc heap dumps under suspicion of memory pressure, no continuous profiling at all. With Eliya, they now have automatic heap dumps on OOM with structured paths, NMT summary, crash log generation, container support reinforced, and diagnostic VM options unlocked: the capability they previously couldn't get without violating data-egress policy. Phase 2 will add continuous JFR (24h rolling, sub-1% overhead) and unified GC logging by default, so the incident window is already on disk without an operator starting a recording.

The four-phase compounding

Eliya is a forensic-grade JVM platform that compounds across four phases, each one extending the position.

Phase 1 (current, shipped 25.0.3): six operational-readiness ergonomics. Heap dump on OOM with structured path, exit-on-OOM, Native Memory Tracking summary, predictable crash log path, container support reinforced, diagnostic VM options unlocked (which makes JFR and async-profiler usable with a single flag today). Defaults that survive incidents and crashes. Activated by -XX:EliyaProfile=Production; off by default, opt-in. Phase 2 (planned) adds continuous JFR recording (24-hour rolling buffer) and unified GC logging as defaults under the same flag.

Phase 2 (planned): bundled local-only diagnostic tooling. Eclipse MAT headless and async-profiler ship inside the Eliya image alongside the artefacts they analyse; analysis tools live in the perimeter with the data. FIPS-validated build variant for procurement-restricted environments. Capability flag carve-outs for fine-grained control.

Phase 3: Asymm Forensics cross-correlation engine. JFR + heap dumps + thread dumps + GC logs + crash artefacts analysed together to generate compliance-aligned root-cause reports. No APM does this because no APM has the raw artefacts. The architectural argument that Phase 1 makes possible.

Phase 4 (demand-gated): compliance-aligned profile flags. EliyaProfile=PCIDSS, =HIPAA, =SOX, =FedRAMP, etc., activating framework-specific JVM settings while preserving the local-only forensic discipline. The flag namespace is reserved today; implementations ship as customer demand triggers each framework.

APM tools occupy a different category that Eliya doesn't try to compete with. The four-phase trajectory builds a forensic-platform position no APM is structured to reach.

Decision framework

Use APM only	Use Eliya only	Use both
You need cross-service distributed tracing as the primary diagnostic surface, and your environment permits SaaS data egress.	You operate under data-egress restrictions that prohibit SaaS APM (BFSI, healthcare, public sector, air-gapped industrial).	You want APM for team-wide dashboards and alerting, plus Eliya for the forensic data APM cannot capture, the dominant pattern at scale.
You're at a development scale where forensic-grade artefacts are over-engineering for the questions you ask.	You need filesystem-encoded compliance audit trail and APM dashboards aren't an auditable artefact for your framework.	You're a Datadog Continuous Profiler customer; Eliya's JFR defaults directly feed Datadog's profiler.
Your stack is so cross-language (Go, Rust, Python beside Java) that the APM vendor's polyglot view is the primary win.	You need to analyse JVM internals (JIT compilation, native memory, crash forensics) and APM cannot reach that level.	You need both real-time operational visibility (APM) and post-incident forensic analysis (Eliya): most production teams.

Eliya is not an APM replacement

APM tools provide real-time operational visibility that Eliya doesn't try to compete with. Eliya is a JDK distribution that provides production-ready forensic-grade defaults; APM tools are observability platforms. Use both where it makes sense; use Eliya where SaaS APM is prohibited; use APM where dashboards and distributed traces are the primary need.

For a summary of Eliya's flag taxonomy and what each phase ships, see What is Eliya? For the vendor-by-vendor JDK comparison, see Choosing a JDK in 2026.

home