Lessons from production: why Eliya's defaults look the way they do

Pattern 1: When 4 GB ≠ 4 GB

A recurring production pattern in containerised JVM deployments: teams set -Xmx4G to match a 4 GB container memory limit, then encounter allocation failures or OOM kills well below the configured heap maximum.

The cause is well-understood but consistently surprises teams new to container memory math:

The 4 GB container limit is the total memory budget for the cgroup, not just the JVM heap
JVM overhead beyond heap includes: metaspace (~150–300 MB), thread stacks (~1 MB per thread × hundreds of threads), JIT code cache (~240 MB), GC bookkeeping, native memory for NIO direct buffers, JNI/FFM allocations, and the JVM process itself
Page cache and other kernel-side overhead also draw from the cgroup budget
An -Xmx4G setting on a 4 GB container realistically gets ~2.5–3 GB of usable heap before something kills the container

In 15+ years of production JVM operations across telecom identity platforms, BFSI settlement engines, and enterprise SaaS deployments, this pattern recurs across organisations regardless of team experience level.

Eliya's response

UseContainerSupport is enabled by default (already upstream in JDK 25, but reinforced by Eliya for clarity).
Native Memory Tracking enabled by default makes the off-heap memory visible; you can diagnose "where is the rest of my container memory going?" by examining the NMT summary.
For container deployments, teams often get more predictable results with -XX:MaxRAMPercentage (e.g., 70%) than with explicit -Xmx, since percentage-based sizing automatically accounts for container memory-limit changes. Eliya does not set this for you; it's a tuning decision that depends on JVM workload patterns and non-JVM container memory consumption.

How operators use NMT for Pattern 1

When a container OOM-kills the JVM despite heap usage below -Xmx:

Output shows native memory consumed by category:

Native Memory Tracking:
 
 Total: reserved=4194304KB, committed=2891024KB
 - Java Heap (reserved=2097152KB, committed=2097152KB)
 - Class (reserved=1080888KB, committed=34744KB)
 - Thread (reserved=124416KB, committed=124416KB)
 - Code (reserved=240000KB, committed=240000KB)
 - GC (reserved=87040KB, committed=87040KB)
 - Internal (reserved=52480KB, committed=52480KB)
 - Other ...

The Class, Thread, Code, GC, and Internal categories combined often exceed heap size for moderately complex applications. Reading this output immediately surfaces what -Xmx alone hides: the JVM's full memory footprint, not just heap. APM tools don't have this visibility.

Pattern 2: Pause variance kills SLAs, not pause medians

A recurring observation from production GC analysis: teams tune for and report median GC pause times, then get surprised when p99 latency SLAs fail.

A typical pattern from production G1 logs in high-traffic Java workloads: young-generation evacuation pauses ranging from single-digit milliseconds to several tens of milliseconds across consecutive GC cycles at similar heap occupancy. The median pause is acceptable. The variance is what breaches the latency SLA.

This pattern appears across Eliya's target industries:

BFSI: real-time payment processing, settlement systems, algorithmic trading. p99.99 latency tied to transaction timing windows.
Telecom: identity platforms, BSS components, signalling gateways. Latency tied to subscriber experience and protocol timing.
Healthcare: real-time clinical decision support, image processing pipelines. Latency tied to clinical workflow timing.
Government: high-availability citizen services, authentication infrastructure. Latency tied to SLA contracts with regulatory penalties.

Each industry has different acceptable latency bands, but all share the structural property that variance, not the median, determines whether the SLA is met.

Eliya's response

Phase 1 today (25.0.3): the unlocked diagnostic options make GC events visible via jcmd <pid> JFR.start settings=default against the running JVM in one step (no restart, no second flag); the deep options JFR needs to sample accurately are already in place. NMT summary surfaces native memory pressure that correlates with GC variance.
Phase 2 (planned): continuous JFR + unified -Xlog:gc* with rotation become defaults under EliyaProfile=Production. Every individual pause captured as a structured event; variance visible without operator action.
The -XX:+UseZGC opt-in pattern is documented in the flags reference for sub-millisecond pause guarantees across a wide range of heap sizes (Generational ZGC in JDK 21+ works well from a few GB up to multi-TB). For latency-critical workloads where every microsecond matters, benchmark ZGC against G1 with your actual workload; the relative throughput trade-off depends on allocation rate and CPU availability.

The lesson: continuous observability beats periodic reporting. You can't tune what you can't measure, and aggregate metrics hide the variance that matters.

Pattern 3: When the JVM gets blamed for upstream problems

A recurring incident pattern: production outage looks like JVM failure (threads hung, requests timing out, application unresponsive). Initial investigation focuses on the JVM. Root cause turns out to be upstream (load balancer queue saturation, downstream database connection limits, network middleware misconfiguration), but the JVM appeared to be the problem because:

Threads were blocked on socket reads (waiting for upstream responses that never came)
Connection pools exhausted (from connections held waiting for upstream)
Application-level timeouts firing (cascading from upstream timeouts)

Without continuous JVM-level observability, teams spend hours diagnosing the JVM before realising the JVM is healthy and the problem lives elsewhere.

Eliya's response

Phase 1 today (25.0.3): Eliya activates heap-dump-on-OOM with a structured path, NMT summary, and unlocked diagnostic VM options; the post-incident forensic artefacts and the deep flags JFR + async-profiler need are already in place. When the incident happens an operator runs jcmd <pid> JFR.start settings=default duration=10m filename=<path> against the already-running JVM; the unlocked diagnostic options make the deep flags usable in one step rather than requiring a restart.
Phase 2 (planned): Continuous JFR with a 24-hour rolling buffer becomes the default under EliyaProfile=Production, eliminating the "start JFR and reproduce" step entirely; the recording covers the incident window without manual intervention.
JFR's event taxonomy (already in the JRE module set today) distinguishes the source of any thread block:
- jdk.SocketRead / jdk.SocketWrite: blocked on network I/O (likely upstream / downstream issue).
- jdk.FileRead / jdk.FileWrite: blocked on disk I/O.
- jdk.JavaMonitorEnter / jdk.JavaMonitorWait: blocked on synchronisation (likely application code).
- jdk.GarbageCollection: blocked by GC pauses.
- jdk.SafepointBegin: blocked by JVM safepoints.
- jdk.ThreadPark: explicit application blocking (e.g., CompletableFuture chains).
Reading these events from a JFR recording immediately answers "is the JVM healthy or is something external?": the question Pattern 3 outages otherwise burn hours diagnosing.

Pattern 3 and APM coexistence. Standard APM tools tell you that latency increased; they show the symptom. JFR tells you what the JVM was doing during that latency; it shows the cause. The two are complementary: APM identifies "something happened at 14:23"; JFR identifies "threads were blocked on socket reads to upstream service X." For the architectural distinction, see JVM forensics vs APM.

The lesson: continuous observability that's already running when incidents occur is qualitatively different from observability you have to enable after the fact. The latter requires reproducing the problem; the former lets you analyse the original incident.

Pattern 4: Crashes preserve no evidence without configured paths

A recurring incident pattern: a JVM segfaults or hits OutOfMemoryError. The JVM crashes. The container restarts. An on-call engineer investigates. What's available?

Without observability defaults:

Heap dump on OOM is off by default (upstream OpenJDK default).
The crash log hs_err_pid<PID>.log is written to the JVM's working directory, which in containers is often ephemeral and lost on restart.
JFR is off by default, so no continuous record of what was happening exists.
The container's stdout / stderr is the only evidence, and that captures application output, not JVM state.

Eliya's response

HeapDumpOnOutOfMemoryError is enabled by default with a structured path under ${ELIYA_DIAGNOSTIC_PATH}/${service}/${replica}/heap/.
The crash log path points to ${ELIYA_DIAGNOSTIC_PATH}/${service}/${replica}/crash/.
Continuous JFR recording with dumponexit=true (Phase 2) will flush to disk on any termination, including a fatal crash where the JVM has the chance to write before exiting.
When operators mount ${ELIYA_DIAGNOSTIC_PATH} as a persistent volume, all three artifacts survive container restarts.

The lesson: forensic data must be configured to land on persistent storage before the crash, not after. The post-incident investigation either has the data or it doesn't.

What these patterns have in common

All four share a structural property: they are visible in the data the JVM already produces, but only if observability is enabled before the incident occurs.

Phase 1 today (25.0.3) puts the post-incident forensic artefacts in place by default: heap-dump-on-OOM with structured paths, NMT summary, crash log path, container support reinforced, and diagnostic VM options unlocked so JFR + async-profiler are one-flag-away when you need them. Phase 2 (planned) makes the streaming data (continuous JFR with a 24-hour rolling buffer, unified GC logging) defaults too, so the recording covers the incident window without manual intervention.

The pattern that emerges from 15+ years of JVM operations is not "we need better debugging tools." It's "we need data that's already there when the incident happens." Eliya's design follows from this observation.

Kubernetes context. Each of these patterns is especially common in Kubernetes deployments: Pattern 1 hits hard because container memory limits are strict, Pattern 2 because pod resource constraints create variance, Pattern 3 because service-mesh layers add upstream complexity, and Pattern 4 because container restarts destroy ephemeral working directories. The patterns are general, but Kubernetes amplifies all of them.

How these patterns motivate future phases

The four patterns above motivate Eliya's Phase 1 operational-readiness defaults (shipped 25.0.3). They also motivate the Phase 2 / 3 / 4 trajectory:

Phase 2 bundles Eclipse MAT (headless) for offline heap analysis (deeper Pattern 1 investigation) and async-profiler for hardware-counter profiling (deeper Pattern 2 investigation). Both run locally in the security perimeter.
Phase 3 Asymm Forensics will correlate JFR + heap + thread + GC + crash artifacts together, automating the "is this JVM or upstream?" diagnosis that Pattern 3 requires.
Phase 4 compliance profiles add framework-aligned audit trails to all four patterns, so the same diagnostic data that resolves an incident also satisfies PCI DSS Section 10 / HIPAA Technical Safeguards / SOX IT general controls requirements.

The Phase 1 defaults are the foundation. Phase 2 / 3 / 4 capabilities are additive; they extend the same observability discipline this page describes.

About these patterns

These are recurring patterns from 15+ years of production JVM operations across telecom and enterprise systems. The technical specifics (heap sizes, pause-time ranges, deployment topologies) are representative ranges rather than single-event measurements.

user guide