AWS Step Functions is an AWS service that enables you to orchestrate tasks across AWS services. You can use it to break processes into steps that can accept inputs from another service and provide outputs to the next.
AWS Step Functions allow you to design and build the flow of execution of AWS serverless modules in an application. This lets developers focus solely on ensuring that each module performs its intended task, without having to worry about connecting each module with others.
Step functions are created and controlled using a visual workflow editor. With this editor, you can simplify application development through the creation of state machine diagrams. These diagrams enable you to build, share, and modify application behavior with minimal effort. Once your steps are configured, Step Functions automatically triggers and tracks your steps. If a step fails, it will automatically retry that step.
Step Functions provides quite a bit of convenient functionality: automatic retry handling, triggering and tracking for each workflow step, and ensuring steps are executed in the correct order.
How does AWS Step Functions work?
Step Functions is built on four main concepts: State Machine, state, tasks and activities.
The state machine is the main component of Step Functions. It defines the flow of your process and is used to direct the step function you create. State machines are made of JSON files and can be built and run through the API or through the AWS Console.
A state machine is defined using the JSON-based Amazon States Language. When an AWS Step Functions state machine is created, it stitches the components together and shows the developers their system and how it is being configured.
Within your state machine you define your various states. States refer to the status of a task or service. These states are used as triggers for your workflow and can be used to begin, pause, complete, or terminate tasks.
When configuring your states, there are clearly defined options you are restricted to. These include:
Choice state — used to branch your task execution (i.e. if/then)
Fail or succeed state — used to halt task execution according to the defined outcome
Pass state — used to pass an input to a service or task
Wait state — used to delay a task for a certain period or until a specific time
Parallel state — used to start parallel branches of a task
Task state — used to execute code in your state machine document
All work in the state machine is done by tasks. A task performs work by using an activity or an AWS Lambda function, or passing parameters to the API actions of other services.
Like states, you also define tasks within your state machine. Tasks are the individual steps that you want to accomplish in your step function. When creating tasks, you can create one as a Lambda function or as an activity.
Lambda is a service that enables you to run small blocks of code on a serverless infrastructure.
It can be used to perform individual operations or as the backend for serverless applications. Lambda functions can be written in most languages and are charged on a pay-for-use basis.
Activities are bits of code or processes that are hosted or performed on non-Lambda services or resources. This can include longer processes than what Lambda is capable of performing or manual tasks. To use activities, you need to call the GetActivityTask API. You can then report the result using SendTaskSuccess or SendTaskFailure.
AWS Step Functions Features:
⦁ Built-in error handling – AWS Step Functions tracks the state of each step, so you can automatically retry failed or timed-out tasks, catch specific errors, and recover gracefully, whether the task takes seconds or months to complete.
⦁ Automatic Scaling – AWS Step Functions automatically scales the operations and underlying compute to run the steps of your application for you in response to changing workloads. Step Functions scales automatically to help ensure the performance of your application workflow remains consistently high as the frequency of requests increases.
⦁ Pay per use – With AWS Step Functions, you pay only for the transition from one step of your application workflow to the next, called a state transition. Billing is metered by state transition, regardless of how long each state persists (up to one year).
⦁ Execution event history – AWS Step Functions creates a detailed event log for every execution, so when things do go wrong, you can quickly identify not only where, but why. All of the execution history is available visually and programmatically to quickly troubleshoot and remediate failures.
⦁ High availability – AWS Step Functions has built-in fault tolerance. Step Functions maintains service capacity across multiple Availability Zones in each region to help protect application workflows against individual machine or data center facility failures. There are no maintenance windows or scheduled downtimes.
⦁ Administrative security – AWS Step Functions is integrated with AWS Identity and Access Management (IAM). IAM policies can be used to control access to the Step Functions APIs.
⦁ Parallelization: You can parallelize the work declaratively. A step machine can have a state calling multiple states in parallel. This will make the workflow complete faster.
⦁ High Execution Time: Step Functions has one year as max execution time so if some of the tasks of the workflow are high (more than 15 minutes), they can be run on ECS or EC2 or as an Activity hosted outside of AWS.
AWS Step Functions pricing:
The AWS free tier includes 4,000 AWS Step Functions state transitions per month. This part of the free tier doesn’t expire, so you can take advantage of it even if your AWS account isn’t brand new.
Beyond the free tier, Step Functions is priced at $0.025 per 1,000 state transitions.
It may be hard to visualize exactly what that would that mean for your monthly AWS bill, so we include a few example pricing scenarios below.
Drawbacks of AWS Step Functions:
⦁ Vendor lock-in: Amazon Step Functions is proprietary and can only be used on AWS. If you decide, later on, you wish to migrate to a different cloud vendor, you will need to re-design the orchestration layer or altogether replace it with an alternative offered by the new vendor.
⦁ Complex syntax: The Amazon States Language, which is used to configure step functions, is highly complex. The syntax of this language is based on JSON. This means the language is ideal for machine readability, not for humans. Learning this language can be challenging, and you can only use it for AWS Step Functions, as it is proprietary to AWS.
⦁ Shorter Execution History: The maximum limit for keeping execution history logs is 90 days. It cannot be extended and that may preclude the use of Step Functions for businesses that have longer retention requirements.
⦁ Missing Triggers: Some Event Sources and Triggers are still missing, such as DynamoDB and Kinesis.
⦁ State machine Execution name: Each Execution name for a state machine has to be unique (not used in the last 90 days). This can be very tricky.
AWS Step Functions Limits:
In addition to the above drawbacks, the Step Functions service has several built-in service limits you should be aware of:
A maximum of 25,000 item execution history per workflow:
This limitation does not present an issue for the majority of use cases. You can perform long running executions with a higher number of state transitions, by splitting the workflow into multiple workflows that do not exceed the 25,000 limit.
1MB maximum request size:
A request made to AWS Step Functions cannot carry a payload that is larger than 1MB. You can use larger files – if you store the files on Amazon S3 and use S3 URIs as inputs.
Spikes in AWS API requests caused by a workflow:
A peak in API requests might get throttled. If some workflow components inefficiently use the AWS API, a sudden spike in requests could trigger API limitations. To avoid this issue, you can group requests into a single API call (for made to the same service). Alternatively, you can introduce timeouts between operations.
50 tags per resource:
Each Step Functions resource can have a maximum of 50 tags. If you need more tags, you’ll need to change your resource structure.
AWS Step Functions Alternatives:
Here are several alternatives to Step Functions within the AWS ecosystem:
Schedule AWS Lambda functions: You can run simple workflows (consisting mainly of one Lambda function) by incorporating the workflow logic into a Lambda function. You can then trigger the function by using an AWS Lambda schedule event.
Combine Lambda functions with other AWS services: Some AWS services can manage entire functional tasks, such as user authentication. You can leverage these services to achieve faster implementation times and lower costs.
Use queues for communication between services: For services that need to handle extremely high load, use queues to improve cross-service communication.
AWS Step Functions Best Practices
Use the following best practices to avoid common pitfalls with AWS Step Functions:
Resume process from fail state – in a workflow, we sometimes need to resume the process from the fail state as opposed to re-running it from the beginning. This isn’t provided as a built-in feature, but there is a workaround to achieve this.
Avoid infinite runs – State Machine can run infinitely. It has a max execution time of one year. On top of that, it provides a feature “Continue as new Execution”. This allows you to start a new execution before terminating your current running execution. This opens up the possibility of it running infinitely by mistake. Monitoring execution metrics is a good way to identify and fix those mistakes.
Overcome the 25,000 event entries limit – you can implement a “Continue as new Execution” pattern, spinning up a new execution from an existing running execution. For example, if a long-running execution has 10 steps and you’re expecting to have 40,000 event entries in your execution history, start a new execution at step 5 and distribute entries between two executions.
Handle timeouts – by default, the Amazon State Language doesn’t set timeouts in state machine definitions. In a scenario where a Lambda Function or Activity has a problem and keeps running without responding back to Step Functions, it will keep waiting for a year (max timeout) at least. To prevent this, set the timeout using TimeoutSeconds.