Early last month (4/2), our Splunk implementation started experiencing high load, indexer’s queues filled and they stopped accepting data, and it eventually got bad enough that searches were affected. While Splunk is a crazy experience due to the diverse, dynamic nature of attempting to ingest data from about 100 fairly independant I.T. shops affiliate with my employer, a very large educational instituition, I love that I get to focus on one system and try to bring order to its chaos and having it go down was one of the worst feelings in the world.
Quick Shoutout to My Co-Workers: Thanks for Your Understanding
I don’t remember exactly why, but my supervisor who is also dedicated to Splunk, was unavailable so it came down to me. Luckily, I work on an amazingly talented team (I have no idea how I was able to trick them into letting me join), so their attitude was a very practical: it happens, work on getting it back up ASAP.
One co-worker, Lincoln Rutledge, popped his head in from next-door to make sure I was aware of the situation and when I said yep but I was coming up empty handed on a root cause, he dropped what he was doing for over an hour to poke around and help come up with ideas Especially considering he’s focused more on firewall/network and from a Splunk perspective is a user not an admin, it was REALLY appreciated.
Our primary purpose is security, so our “real” users are my own team; the users in the various departments view us as a bit of a bonus since from their perspective, we’re a free service. A couple have come to depend on us, so they inquired when we’d be available; a couple commented on how the old methods continued to work but they really apprreciated Splunk as it was easier/better. So as far as I.T. goes, not a bad reaction, but still having your service down is just a bad situation and a hit to the professional pride.
To make matters worse, over a month later, we still don’t have an answer. Support’s best guess is its IOPS. I disagree, so I’m in a bit of a hurry to get hard data to either change my position to blaming IOPS (hey, I like to be right, so I’ll happily turn-coat if the data supports it) or to move on to sniffing around for the real root cause. Support pointed us towards an IOPS app [https://github.com/dataPhysicist/iops] that they wanted us to install. While it sounds like there should be two apps, Splunk’s support has said it is safe to deploy this apps on both the search heads and the indexers.
Using the app, I got 42587 IOPS on a search head, which is VERY high. The instructions for the app suggest running it on a block device to avoid disk caches, but that requires root access and we don’t run Splunk as root. As a temporary measure, I manually ran it as root and got back 600 iops on the search head and 1,800 iops on an indexer. Since they were questioning our ability to hit 1,000 iops on the indexers, these measurements gave me a warm fuzzy feeling.
For whatever reason, support is concentrating on the 42k iops number and (correctly) saying its unrealistic and must be the result of caching. One of their objections has been that I ran the command manually. I’m not convinced that really matters, but some times, its easier to just go with the flow.
IOPS App “Improved”
I’ve forked dataPhysicist’s app and I am working on modifying it to work with sudo. The basic idea is to configure sudo so that the splunk user can run the iops command as root without requiring a password. Initially, I was just creating an sh wrapper script; the interesting part is trying to find the correct python to invoke etc. so I’m now exploring utilizing a python wrapper so I can just re-use the same interperter and know its the correct one.
Not So Improved
So there are a couple of big cavaets with this version:
- It will only work on systems with sudo; this means it will not work on Windows.
⋅⋅* Honestly, if you are concerned about performance you really shouldn’t be running Splunk under windows. I believe Splunk says there’s about a 20% higher overhead by the operating system and the number of connections it can handle is far lower.
I understand some times there are other considerations; before October I was in a Window’s shop for about five years and suggesting a system using Linux would have been a show stopper (oddly enough Tomcat and JBoss weren’t necessarily show stoppers, so we had several instances running on Windows whish is probably worse then running Splunk on Windows).
⋅⋅* At the end of the day, I needed a solution, so I worked on something that works for me and is hopfully useful for others. If you have a suggestion on how to make it Window compatible, please let me know.
- If the script is writable/updable by the Splunk user and is listed in the sudoers file, then you are effectively giving the Splunk user unrestricted access as root. This means you either give up a bit of security (Splunk does offically reccomend you don’t run as root; this is just adding some obscurity to prevent root access from Splunk) OR you ensure the file isn’t writable/updatable by Splunk which means manually deploying it. ⋅⋅* I’m assuming you’re using the deployment server when I say you must manually deploy it; that claim is based upon the fact that the deployment client runs as the same user as the rest of Splunk. If you are using a different deployment tool (e.g. puppet, cfengine, chef) then it you could use it to deploy this application without effectively granting Splunk unrestricted root access.