Troubleshooting OpenSearch Cluster Health
OpenSearch is an open source search and analytics solution used within . Although OpenSearch is a third-party product, on-premises Workspace ONE Access administrators can benefit from a generic OpenSearch troubleshooting guide.
You can search the Internet for more tips and tricks on how to troubleshoot your Workspace ONE Access and OpenSearch environment. OpenSearch is a fork of the previously used Elasticsearch used in Access and vIDM versions prior. Access migrated from Elasticsearch to OpenSearch 1.x with version 22.x and to version 2.x with the 23.x release.
This guide is intended for experienced IT administrators of existing environments. Knowledge of Workspace ONE Access and OpenSearch/Elasticsearch is assumed.
Troubleshooting OpenSearch Cluster Health
Most issues with OpenSearch and Workspace ONE Access arise when you create a cluster. It is common to see the following errors in the System Diagnostic/Resiliency page:
First test some common issues affecting OpenSearch health around disk usage and RabbitMQ.
Due to the default policy of never deleting old data, the /db filesystem usage will continue to grow until it runs out of space, so its important to monitor how much free space there is on the /db filesystem before it gets too full. Elasticsearch will stop writing documents when it get 85% full (set via the cluster.routing.allocation.disk.watermark.low setting) and RabbitMQ will stop working when it gets down to just 250MB free (set via the disk_free_limit setting). Details on how to determine the growth rate per day and how to change that policy is in the troubleshooting section.
Check the free disk space for the /db filesystem with the command:
The rough calculation of the per-day growth rate described in the "Avoid disk space issues" section will let you determine how many days before you run out of space and need to either add space or adjust the data retention policy.
A very early indicator that something is wrong is if the size of pending documents in the analytics queue is growing. We use RabbitMQ for other types of messages, which go in their own queues, but the analytics queue is the only one that does not have a TTL, so is the only one that can grow. This is so its messages never get deleted and the audit records never get lost. Typically the analytics queue should have just 1 or 2 messages waiting to be processed, unless something is wrong or there is a temporary spike in records being generated (eg, during a large directory sync the queue might temporarily grow if the system cannot keep up). The queue size should be checked on each node with the command:
rabbitmqctl list_queues | grep analytics
If the number reported is larger than 100, its time to investigate why - see the "RabbitMQ analytics queue is large" section. This is also reported as the "AuditQueueSize" value in the health API: /SAAS/API/1.0/REST/system/health/
Connect to the console (either via vSphere remote console or SSH) of your Workspace ONE Access virtual appliances. After you log in as ROOT, run the following command to verify the cluster health:
The status is usually either
red. (If you have a single Workspace ONE Access node, the cluster health is always
yellow because there is no cluster).
The different status indicators.
everything is good, there are enough nodes in the cluster to ensure at least 2 full copies of the data spread across the cluster.
functioning, but there are not enough nodes in the cluster to ensure HA (eg, a single node cluster will always be in the yellow state because it can never have 2 copies of the data).
broken, unable to query existing data or store new data, typically due to not enough nodes in the cluster to function or out of disk space.
You can also see how many unassigned shards your cluster has.
Next, verify that all three nodes report the same primary cluster. Run the following command on each of the nodes:
Compare the output between all three nodes. A common reason for a
yellow or red status on the OpenSearch cluster is that all nodes do not share a common view of the cluster.
You might see that one node does not list the same node as the primary and it might not even list the other nodes. This node is most likely the culprit of a yellow state.
Most of the time, a simple restart of the OpenSearch service is enough to bring the node back into the cluster. But also do the previous mentioned checks on disk space and free up or add disk space if required.
On the node at fault, run the following commands, first checking the status of the service:
service opensearch status
Then stop the opensearch service:
service opensearch stop
Wait a few minutes and then run:
service opensearch start
Give OpenSearch time to start and verify that all nodes now report the same primary cluster and that the primary cluster lists all nodes:
Next, check the cluster health again:
For red status it can be the same issue as above or a master node misbehaving and not accepting connections from the other nodes. Determine the master node and restart the service on that node. One of the other two nodes should immediately become the master (re-run the curl on the other nodes again to verify) and the old master will re-join the cluster as a regular node when it comes up.
If you need to check the logs further be aware of the changes to the configuration and log locations after the switch from Elasticsearch to OpenSearch:
After you have completed the previous steps, you should have a system that reports each status as
green and has no unassigned shards.
Finally, each component on your systems diagnostic page reports a successful green status.
In any case, if you still have unassigned shards, this must be fixed. DO NOT DELETE UNASSIGNED SHARDS. You need to address the underlying cause, then the unassigned shards will be resolved by the cluster on its own. It is extremely rare for unassigned shards to be a real issue needing direct action. Here is an excellent article on the causes of unassigned shards and how to diagnose the cause:
Summary and Additional Resources
This guide provided basic steps to troubleshoot OpenSearch cluster health in your Workspace ONE Access environment.
The following updates were made to this guide.
Description of Changes
• Updated with OpenSearch details.
• Guide was published.
About the Author
This guide was written by:
The original version was written by Peter Bjork, Principal Architect, EUC Division Broadcom.