Troubleshooting Splunk Error “Search Process Did Not Exit Cleanly”
When Splunk displays an orange warning triangle instead of a chart or table it is time to investigate. Start by clicking the triangle to bring up a dialog with the error message. In my case that looked like this:
Finding the Root Cause
In many cases, the best resource for troubleshooting Splunk searches is Search job inspector. You can open it by clicking the i icon below a chart:
This opens Search job inspector in a new browser tab. The top of the page summarizes search properties and lists the errors that have occurred:
Sometimes that is all you need. Not in this case, though. We have a distributed deployment with search head and indexer clusters. In such a scenario it is not always trivial to get to the right log file from the right machine.
Scroll to the bottom of the page and expand Search job properties. Scroll down once more. The last row has what we need: links to the relevant search logs on the indexers:
As you saw above we got the same error on both our indexers splunk-l3 and splunk-l4. We’ll just look at splunk-l3 for now by clicking the link in the additional info row in search job inspector. This is what we get:
07-20-2017 22:39:24.810 INFO dispatchRunner - Search process mode: preforked (reused process) 07-20-2017 22:39:24.811 INFO dispatchRunner - registering build time modules, count=1 07-20-2017 22:39:24.811 INFO dispatchRunner - registering search time components of build time module name=vix 07-20-2017 22:39:24.812 INFO BundlesSetup - Setup stats for /opt/splunk/var/run/searchpeers/13A7F4FB-8087-49CF-9097-2497E1AB27B3-1500582939: wallclock_elapsed_msec=67, cpu_time_used=0.0360000, shared_services_generation=2, shared_services_population=1 07-20-2017 22:39:24.812 INFO UserManager - Setting user context: splunk-system-user 07-20-2017 22:39:24.812 INFO UserManager - Done setting user context: NULL -> splunk-system-user 07-20-2017 22:39:24.812 INFO UserManager - Unwound user context: splunk-system-user -> NULL 07-20-2017 22:39:24.812 INFO UserManager - Setting user context: helge 07-20-2017 22:39:24.812 INFO UserManager - Done setting user context: NULL -> helge 07-20-2017 22:39:24.814 INFO dispatchRunner - search context: user="helge", app="uberAgent", bs-pathname="/opt/splunk/var/run/searchpeers/13A7F4FB-8087-49CF-9097-2497E1AB27B3-1500582939" 07-20-2017 22:39:24.814 INFO SearchParser - PARSING: tstats sum(Process_NetworkTargetPerformance.NetTargetSendMB) AS "Send volume (MB)" sum(Process_NetworkTargetPerformance.NetTargetReceiveMB) AS "Receive volume (MB)" sum(Process_NetworkTargetPerformance.NetTargetSendReceiveMB) AS "Send+Receive volume (MB)" from datamodel=uberAgent.Process_NetworkTargetPerformance where (nodename = Process_NetworkTargetPerformance) (Process_NetworkTargetPerformance.NetTargetRemotePort="*") (Process_NetworkTargetPerformance.AppName=*) (host="*") groupby Process_NetworkTargetPerformance.AppName prestats=true | addinfo type=count label=prereport_events | fields keepcolorder=t "Process_NetworkTargetPerformance.AppName" "Process_NetworkTargetPerformance.NetTargetReceiveMB" "Process_NetworkTargetPerformance.NetTargetSendMB" "Process_NetworkTargetPerformance.NetTargetSendReceiveMB" "prestats_reserved_*" "psrsvd_*" | prestats dedup_splitvals=t sum("Process_NetworkTargetPerformance.NetTargetReceiveMB") sum("Process_NetworkTargetPerformance.NetTargetSendMB") sum("Process_NetworkTargetPerformance.NetTargetSendReceiveMB") by "Process_NetworkTargetPerformance.AppName" 07-20-2017 22:39:24.873 INFO UserManager - Unwound user context: helge -> NULL 07-20-2017 22:39:24.874 ERROR dispatchRunner - RunDispatch::runDispatchThread threw error: Comparator '=' has an invalid term on the right hand side: NetTargetSendLatencyMs*NetTargetSendLatencyCount
Bingo! The last row has an error message pointing to a problem with the expression NetTargetSendLatencyMs*NetTargetSendLatencyCount. At last we know what’s wrong.
Fixing the Error
The search is against a data model, so let’s look for the problematic expression NetTargetSendLatencyMs*NetTargetSendLatencyCount in the app’s data model JSON file, located in the app subdirectory default/data/models. The expression is used in a field calculated by the following eval expression:
Exactly what Splunk was complaining about. Apparently Splunk has recently become a bit finicky when fields are NULL in calculations. Let’s replace the expression with something a bit safer:
if (isnotnull (NetTargetSendLatencyMs), if (isnotnull (NetTargetSendLatencyCount), NetTargetSendLatencyMs*NetTargetSendLatencyCount, null()), null())
And that’s it. Happy splunking!