Workspace ONE Intelligence is an AWS-based cloud service that provides reporting, analytics, data visualization, and workflow orchestration to VMware customers. It receives data from a variety of VMware internal sources including our UEM (Unified Endpoint Management) platform, Workspace ONE Access (identity manager), mobile SDKs, Windows agents, and external third-party Trust Network integrations like Lookout, Netskope, and Wandera, to name a few.
The Intelligence automations platform was first released in 2017 and has since evolved as one of the key offerings of the Workspace ONE Intelligence product. Our automations platform allows customers to work towards the goal of a fully autonomous, self-configuring/self-healing workspace by automating day-to-day workspace management tasks.
Today, we use a custom, in-house orchestration engine that can run through and execute simple, sequential tasks. These workflows can be triggered by changes in any system (VMware platform or supported third-party integrations) that is integrated with Intelligence. They can also be triggered on demand or on a schedule. Workflows contain tasks that act on any one of the connections that we support today or as defined by a custom connection defined by the user. A catalog of custom connectors can be found on GitHub.
This post will discuss some of the challenges posed by the current automations engine, how we are solving those challenges, and an overview of our migration plan to move to a more robust and stable Conductor-based automations engine.
The Challenge – Complex Use Cases
Since its inception, the automations platform has gained a lot of momentum and now logs 1 billion workflow executions monthly with customers demanding the capability to solve more complex use cases at a much larger scale.
Dubbed Freestyle Automation, we identified a few key areas to work on to meet these demands:
- Orchestration - to support complex workflows at a large scale, we need a robust and reliable orchestration engine.
- Distributed executions - the ability to hand off "sub workflows" to external systems when necessary.
- Workflow DSL (Domain Specific Language) - an intuitive language to represent complex workflows.
- Enhanced feature set - addition of parallel executions, conditional evaluations, customizable action inputs and outputs, etc.
Orchestration - In-House vs Open-Sourced Solutions
Our current orchestration capabilities are custom and implemented across a handful of microservices in our ecosystem. To have our in-house orchestration capabilities scale and match customer demand, we would require the following:
- Define a new workflow definition data structure that could represent complex workflows with action grouping, conditionals, etc.
- Revamp our workflow execution service to use a more enhanced state machine that would allow for the execution of these complex workflows.
- Add the ability to define custom execution parameters per action in a workflow - allow users to define retry configurations, customize action inputs per execution, customize workflow outputs per execution, etc.
- Support the capability for asynchronous response handling when actions against external services are resource-intensive.
To summarize, we would need to build an entirely new workflow execution platform that met the above requirements.
To avoid reinventing the wheel, we investigated open-source options that had the capabilities we were looking for and could scale efficiently to meet our large execution volumes.
The Solution – Netflix Conductor
Thorough research of all existing, open-source orchestration engines led us to Netflix Conductor. Netflix Conductor is an open-source GitHub project that allows users to define complex workflows using an intuitive DSL and execute them on "Worker" services.
Conductor offers a robust set of features, a few of which were particularly useful for our use case.
- It functions as a state machine to determine what task/step in a workflow can be scheduled next.
- It exposes the capability to set custom timeout and retry configuration per task in a workflow.
- It supports a number of workflow features - forked tasks, conditionals, eventing, etc.
Overall, Conductor provides a base framework that is easily customizable.
Figure 1: Netflix Conductor Runtime Model (from https://conductor.netflix.com/devguide/architecture/index.html)
To deploy Conductor as part of the Intelligence Automations Platform, we had two options:
- Use the cloud-hosted deployment of Conductor powered by Orkes - Orkes provides a cloud-hosted deployment of Conductor that is configurable and uses the open-source Conductor GitHub code base.
- Mirror the Conductor GitHub library to build and deploy the service using our internal build & release framework.
We chose option (2) which gave us more control over the deployments of the underlying technology used by Conductor. This worked well for us - we could configure Conductor to utilize underlying technologies that were consistent with the rest of our data platform. Deploying our custom, mirrored version of Conductor also allowed us to make it an internal-only service, so it is not exposed to the internet.
Supporting Services for Conductor-based Workflow Executions
Having decided on Conductor as our orchestration engine, we now require a few services that support workflow executions.
Connector Manager Service
Intelligence provides the capability for users to set up connectors into external services where actions can be taken. We support several "managed" connectors and actions into internal VMware services and third-party integrations like Slack and ServiceNow. We also allow users to upload custom Postman collections to define new actions into any other service. All these connections and their credentials are managed by the new Connector Manager Service. This service holds all the data required to execute an action. Supported actions also provide the configuration to wait on an asynchronous response from an external service. Some actions may be resource-intensive and cannot be completed synchronously when invoked by an API.
Ledenika (Coming Soon)
Ledenika is a new service that is the source of truth for all workflow definitions (user-created and system owned). It stores workflows in Postgres using a data structure based on the Conductor-DSL. It is important to note here that we specifically chose not to use the Conductor open-source data structures for this persistence layer to reduce dependencies on a third-party orchestrator. Translating persisted workflow definitions to a newer format would be one less thing to worry about, should we choose to move to a different orchestration service in the future.
This service is responsible for constructing the appropriate request to start a workflow execution on Conductor.
Whiteclaw (Coming Soon)
Whiteclaw is the worker service. All tasks scheduled by Conductor are picked up by Whiteclaw for execution. Whiteclaw is intended to be a stateless service that aggregates data required to run a single task and then executes it. Task execution results are posted back to Conductor to advance the workflow to the next step when configured to do so. If an action is configured to use asynchronous responses, Whiteclaw refrains from updating the successful task status on Conductor. This service also maintains a cache of service health and open connections. When a task execution completes with an error, the error is examined to determine if it is terminal or retry-able. In addition, the error is also used to determine if requests to the external service need to be paused for any amount of time. This is particularly useful when we are rate limited or attempting to act on a service that is undergoing maintenance.
TacoTruck (Coming Soon)
This service is responsible for collecting and processing the results of asynchronous action executions in external services. In cases where a workflow involves an action that needs to be asynchronously executed in an external system, this system provides Intelligence with the status of the execution via a REST endpoint exposed specifically for this use case. TacoTruck collects the execution statuses posted to this endpoint. When the execution of the action is complete, TacoTruck posts the execution result to Conductor so it can advance to the next step. When the execution is still ongoing, TacoTruck persists the intermediate execution logs for user visibility. This allows us to supplement Conductor execution logs with logs obtained during action execution in external services.
Figure 2: Workspace ONE Intelligence Freestyle Automations Architecture
Migrating to Conductor-based Workflow Executions
With all the above services in place, we now have a brand-new workflow execution pipeline.
The next challenge is to migrate our existing workflows and executions over to the new pipeline with no customer impact. There are two main challenges with this migration -
- Workflows should continue to provide the same execution result as they previously did.
- No new user setup should be required.
With this in mind, our migration is planned to execute in three phases.
Phase 1: Migrate all existing connector integration to Connector Manager Service
All the external connections we supported needed to be moved to the new Connector Manager Service. This was accomplished using a data translation utility and an ongoing scheduled job that collected all existing connections, translated them to the new model, and moved them to Connector Manager Service.
Phase 2: Migrate existing workflows to use the new workflow definition data structure
Our existing workflows are based on a custom data structure and live in a legacy service called Sweetwater. The first phase of this migration involves the following steps:
- Translate all existing workflows to the new domain model and persist them in the new Workflow Definition Service (Ledenika).
- This is accomplished using a data translation utility and an ongoing scheduled job that selects X number of workflow to translate and move to Ledenika per minute.
- Update our UI to use the new format of workflows.
- The UI is provided with all new APIs that can replace the existing legacy APIs.
- Update our existing execution pipeline to rely on the new Ledenika service to obtain workflows for execution.
- Our existing, custom orchestration pipeline will now require Ledenika to provide workflow definitions to be executed. To minimize the footprint of affected facilities, we convert the workflows back into the older format that our custom orchestration engine understands.
Phase 3: Replace the custom orchestration engine with Conductor + Whiteclaw
This final phase involves moving the orchestration and execution over to the new pipeline.
- Update our workflow trigger services to trigger the new pipeline (we support on-demand, scheduled, and data-based triggers). Each trigger that triggers a workflow must now do so through Ledenika.
- Update our UI to use a newer format of workflow execution logs as supported by Conductor.
Distributed executions of complex workflows allow for the Intelligence Automations Pipeline to delegate work to systems that they are more suited to. This achieves two main objectives:
- Tasks are executed in systems that own the resources being acted upon ensuring integrity of the executions and quicker turnaround time.
- The possibility to easily support a larger set of actionable resources.
Our first venture in the area is an integration with Workspace ONE UEM. This integration will provide users with the ability to act on UEM-managed devices based on several parameters that are internal to a device.
This integration and its design are actively being worked on and will use our new TacoTruck service to asynchronously communicate task execution results back into Intelligence and Conductor.
We have been hard at work building support for this new orchestration engine behind the scenes for the last year with the goal of providing new and improved features for the end users along with supporting scalable and robust workflow executions with improved error handling.
With many parts of this new automations pipeline undergoing integration testing and performance testing scheduled to start shortly, we're extremely excited to report that this new pipeline is a great first step towards solving the problems we set out to tackle.
From a technical standpoint, we now have new services that are responsible for handling very specific operations with data models that scale very well. The new service footprint also allows us to independently scale different parts of our system based on the requirement. From a feature standpoint, we can now provide enhanced workflow authoring capabilities and are well-poised to support distributed executions.
Our next task will be to monitor the behavior of this pipeline under high-volume performance testing and iron out any kinks as necessary. We are also evaluating integrating Conductor with a robust queuing service so executable tasks may be queued outside of this service while they are waiting on a Worker node. This would allow us to scale Conductor and our worker nodes independently of each other.
Stay tuned for the next update on this exciting enhancement for Workspace ONE Intelligence.