Breaking Java Badly — Programmatic JVM Deconstruction (Part three)

The Instability is the Pattern

Previously, we demonstrated how a simple misconfiguration in a producer-consumer test frame can yield intermittent yet crippling instabilities. Well over two thousand separate runs, generated from two specific producer-consumer tests, (Test One — Linked List and Test Two — Array Dequeue) and in conjunction with the two most popular OpenJDK 1.8 memory collectors (Parallel Old and G1), were used to do so. These instabilities themselves form a pattern, a very perilous one if left undiagnosed and unremedied.

Let’s drill down into the worst of these, run #11 from the Parallel Old/36M MCL set (see previous parts in this series), which took nearly ten times longer to execute than the usual median value of around 250 seconds. Events such as these are often explained away as “one-off” events, but, as our test data shows, this is not true. They are far from unique but merely have long periodic cycles. There are no “one-off” events in the cloud.

Source code for this article is located here — https://github.com/markraley/breaking-java-badly

Drilling Down — Single Worse Run, #11–2250 seconds

Run #11 contains the longest-running, and therefore most alarming, execution time observed. An annotated plot is presented in figure one; the y-axis is the percentage of tenured space (generation) that is full as specified by the Xmx and NewRatio JVM settings (see upper left notation). The blue line is the percentage full at the start of the full garbage collection (FGC), and the gold line is the percentage full after the FGC is complete.

The tenured space fills rapidly and remains so. The many FGCs here occur because the producer thread has outrun the consumer, and the resulting long queue places a heavy book-keeping burden on the Parallel Old collector. This plot also has an important annotation — the red hatched area denotes a region in time where the JVM memory collector is running 88.6% of the time hatched and is the proximate cause of the delayed test completion. The tenured space is at or near 100% full during this interval. These FGCs may be said to be frequent and, very importantly, temporally adjacent — one right after the other without significant delay. In the test setup used here (see series part one for details) the JVM by default uses around six of the available eight physical cores to perform any serious memory collection activities, so they are almost all running hot as well.

Contrast this with the expected case where only two cores are busy (one producer thread and one consumer thread) with minimal collection activities. An example of this is plotted in figure two (below). Note the MCL setting is the same for both of these plots.

We can count the FGCs visually and see that the tenured space is never more than about 50% in use with a run time of about 250 seconds. This is the intended performance envelope. While in general the producer and consumer threads are well balanced by the JVM, sometimes the producer thread runs more often. This is for a variety of potential reasons usually deeper in the software stack upon which the JVM depends — including thread scheduling, prioritization, and hardware resource availability. Figure one, then, can be said to demonstrate a kind of unintended self-denial of resources.

An Unintended Denial of Resources

Figure three gives us the bigger picture of the complete 36M test battery. It plots the median queue depth of each test run against their execution time, 72 data points in all. During these tests, the consumer emits a time-stamp as well as the current queue depth every 32 million (32M) queue elements processed. The y-axis in figure three is the median value of these results on a per-test run basis. Outliers in the dataset have their run numbers displayed.

By viewing the data in this way we can see a clear correlation between observed queue depth with high execution times, adding further confirmation that excess capacity is the proximate cause of the observed performance and instability issues under test. The rightmost seven points specifically match the seven outliers seen in part one figure one column 36M, and all show long FGC walls. By seeing this instability in this way, as a pattern of instability, the correct preventative steps may be undertaken.

--

--

--

A computational explorer and armchair data scientist.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Use the Cloud More Creatively

Data Lake — best practices throughout implementation

Apartment Hunting

Foundations of Web Development Series, Part I: Git Basics

Are you sure SIGTERM is received by your application in Kubernetes?

Quantum Algo: Deutsch Algorithm

Lazy Programmer Excuses

Getting involved in your tech community as a newbie in a language

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mark Raley

Mark Raley

A computational explorer and armchair data scientist.

More from Medium

AWS Lambda to invoke other lambda function asynchronously in same region using Java

From JAVA 9 to JAVA 15: Evolutions and new features — Part 1

Create a Simple DataBase and then SELECT, INSERT, DELETE Examples with Java, MySQL, and…

What is java?