Visualizing the Oplog with Splunk

I recently found myself investigating a mysterious occurrence in a production Meteor application. Seemingly randomly, without any obvious connection to user activity or periodic job activity, our Meteor server would spike to one hundred precent CPU consumption and stay pegged there until it was restarted.

After investigating nearly every hunch I could come up with, I was left with very few courses of action. My final theory was that a massive influx of MongoDB operations were flooding into our database. Any concerned observers listening within our Meteor application would be overwhelmed trying to catch up with the changes and consume all available CPU cycles on the server.

In order to test this theory, I wanted to plot the MongoDB Oplog as a time series chart and compare it against the timeline of known CPU spikes, looking for any correlations.

I had many options for how to approach this problem, but I decided to use Splunk to visualize and explore the Oplog data. I’m very happy with how Splunk performed, and I can see myself using it again.

I was interested in all Oplog events that happened in the twenty-four hour period surrounding a known CPU spike at 22:55 UTC on April 23rd, 2018. I fired up Studio 3T and ran the following query against the oplog.rs collection of my MongoDB database:


db['oplog.rs'].find({
  $and: [
    {ts: {$gte: new Timestamp(1524480600,1)}},
    {ts: {$lte: new Timestamp(1524567000,1)}}
  ]
});

The above query returned over seven hundred fifty thousand results, which I was able to export into a JSON file using Studio 3T (Studio 3T is the only MongoDB client I’ve found that supports saving an entire set of query results to file).

Once those seven hundred fifty thousand Oplog events were exported to disk, I was able to upload them directly into a Splunk index. Splunk gracefully parsed the JSON data and flattened each object into a neatly searchable collection.

With the data available in Splunk, I was free to start exploring.

My first step was to plot a time chart of all of the Oplog events. Given the large amount of data I was working with, I decided to bin my events into five minute bands:

index="oplog"
| timechart span=5m count

An overview of our data.

Interestingly, an obvious dip in Oplog events occurred around the time of the observed CPU spike. This is the exact opposite of what I expected to see given my working hypothesis.

Zooming in on the dip.

Investigating further, I decided to plot a time series for every type of Oplog event, based on the op field:

index="oplog"
| timechart span=1m count by op

To improve clarity, I also focused on a narrower time range, reduced my bin size, and switched to a log scale Y axis.

Everything working as intended.

This new chart shows that insert (i) and update (u) operations completely stop during the dip, but no-op (n) operations continue as usual. This seemed to indicate that the database was healthy, but the Meteor application stopped making insert and update requests.

This makes sense. If our server was eating up available CPU cycles, it probably wouldn’t find the time to query the database.

After visualizing the Oplog events around several of these CPU spikes, it became painfully obvious that my working hypothesis was not correct. There wasn’t any influx of database operations prior to a spike, and any dips in database activity were easily attributable to server restarts.

So now we’re back to square one. Was all of this for nothing?

Absolutely not!

When you’re investigating a problem, proving that something is not the cause of the problem can be incredibly valuable. By repeatedly narrowing down the possible set of culprits, we simplify the problem in our minds and make the real cause that much easier to find.

After spending more time digging into this issue, I’m convinced that it’s related to “fork bombs” crippling the server, and discussed in this issue filed against the Meteor project.

That said, this exploration proved to be incredibly valuable. By proving to myself that obverserver overload was not the cause of the spikes, I was able to rule out a huge swatch of potential fixes.

I was also able to spend some time trying out a fantastic new tool. I’m sure I’ll find myself using Splunk again in the future.