Part 2: Simulate Failures

Part 1: Basic Workflow

Part 2: Failure Simulation

In this part, you'll simulate failures to see how Temporal handles them. This demonstrates why Temporal is particularly useful for building reliable systems.

The key concept here is durable execution: your workflow's progress is saved after every step. When failures and crashes happen (network issues, bugs in your code, server restarts), Temporal resumes your workflow exactly where it stopped. No lost work, no restarting from the beginning.

What you'll accomplish:

Crash a server mid-transaction and see zero data loss
Inject bugs into code and fix them live

Difficulty: Intermediate

Ready to break some stuff? Let's go.

Experiment 1 of 2: Crash Recovery Test

Unlike other solutions, Temporal is designed with failure in mind. You're about to simulate a server crash mid-transaction and watch Temporal handle it flawlessly.

The Challenge: Kill your Worker process while money is being transferred. In traditional systems, this would corrupt the transaction or lose data entirely.

What We're Testing

Worker

→

CRASH

→

Recovery

→

Success

Before You Start

Worker is currently stopped

You have terminals ready (Terminal 2 for Worker, Terminal 3 for Workflow)

Web UI is open at http://localhost:8233

What's happening behind the scenes?

The Temporal Server acts like a persistent state machine for your Workflow. When you kill the Worker, you're only killing the process that executes the code - but the Workflow state lives safely in Temporal's durable storage. When a new Worker starts, it picks up exactly where the previous one left off.

This is fundamentally different from traditional applications where process crashes mean lost work.

Instructions

Make sure your Worker is stopped before proceeding. If the Worker is running, press Ctrl+C to stop it.
Start the Worker in Terminal 2:
```
python run_worker.py
```
In Terminal 3, start the Workflow:
```
python run_workflow.py
```
Inspect the Workflow Execution using the Web UI. You can see the Worker is executing the Workflow and its Activities:
Return to Terminal 2 and stop the Worker by pressing Ctrl+C.
Switch back to the Web UI and refresh the page. Your Workflow is still listed as "Running".

The Workflow is still in progress because the Temporal Server maintains the state of the Workflow, even when the Worker crashes.
Restart your Worker by switching back to Terminal 2 and running the Worker command:
```
python run_worker.py
```
Switch back to Terminal 3 where you ran python run_workflow.py. You'll see the program complete and you'll see the result message.

Worker Status: RUNNING

Workflow Status: COMPLETED

Transaction: SUCCESS

Mission Accomplished! You just simulated killing the Worker process and restarting it. The Workflow resumed where it left off without losing any application state.

tip

Try This Challenge

Try killing the Worker at different points during execution. Start the Workflow, kill the Worker during the withdrawal, then restart it. Kill it during the deposit. Each time, notice how Temporal maintains perfect state consistency.

Check the Web UI while the Worker is down - you'll see the Workflow is still "Running" even though no code is executing.

Experiment 2 of 2: Live Bug Fixing

The Challenge: Inject a bug into your production code, watch Temporal retry automatically, then fix the bug while the Workflow is still running.

Live Debugging Flow

Bug

→

Retry

→

Fix

→

Success

Before You Start

Worker is stopped

Code editor open with activities.py

Ready to uncomment the failure line

Web UI open to watch the retries

What makes live debugging possible?

Traditional applications lose all context when they crash or fail. Temporal maintains the complete execution history and state of your Workflow in durable storage. This means you can:

Fix bugs in running code without losing progress
Deploy new versions while Workflows continue executing
Retry failed operations with updated logic
Maintain perfect audit trails of what happened and when

This is like having version control for your running application state.

Instructions

Make sure your Worker is stopped before proceeding.

Edit the activities.py file and uncomment the following line in the deposit method:

# Comment/uncomment the next line to simulate failures.
raise Exception("This deposit has failed.")

Save the file.
Switch back to Terminal 2 and start the Worker:
```
python run_worker.py
```
Switch to Terminal 3 and start the Workflow:
```
python run_workflow.py
```
Let the Workflow run for a little bit, then switch back to Terminal 2 to see the Worker output.

Retry Progress

Attempt 2 of 3

Next retry in 2 seconds

You'll see log output similar to this:

2024/02/12 10:59:09 Withdrawing $250 from account 85-150.
2024/02/12 10:59:09 Depositing $250 into account 43-812.
2024/02/12 10:59:09 ERROR Activity error. This deposit has failed.
2024/02/12 10:59:10 Depositing $250 into account 43-812.
2024/02/12 10:59:10 ERROR Activity error. This deposit has failed.
2024/02/12 10:59:12 Depositing $250 into account 43-812.

The Workflow keeps retrying using the RetryPolicy specified when the Workflow first executes the Activity.

Activity Status: RETRYING

Deposit Operation: FAILING

Workflow: ACTIVE

While the Activity continues to fail, switch back to the Web UI to see more information about the process. You can see the state, the number of attempts run, and the next scheduled run time.
Pretend that you found a fix for the issue. Switch the comments back to the return statements of the deposit() method in the activities.py file and save your changes.

# BROKEN VERSION:
# raise Exception("This deposit has failed.")

# FIXED VERSION:
return "Deposited money into account"

To restart the Worker, cancel the currently running worker with Ctrl+C, then restart the Worker by running:
```
python run_worker.py
```
The Worker starts again. On the next scheduled attempt, the Worker picks up right where the Workflow was failing and successfully executes the newly compiled deposit() Activity method.

Switch back to Terminal 3 where your run_workflow.py program is running, and you'll see it complete:

Transfer complete.
Withdraw: {'amount': 250, 'receiver': '43-812', 'reference_id': '1f35f7c6-4376-4fb8-881a-569dfd64d472', 'sender': '85-150'}
Deposit: {'amount': 250, 'receiver': '43-812', 'reference_id': '1f35f7c6-4376-4fb8-881a-569dfd64d472', 'sender': '85-150'}

Visit the Web UI again, and you'll see the Workflow has completed successfully.

Mission Accomplished! You have just fixed a bug in a running application without losing the state of the Workflow or restarting the transaction!

tip

Try This Challenge

Real-World Scenario: Try this advanced experiment:

Change the retry policy in workflows.py to only retry 1 time
Introduce a bug that triggers the refund logic
Watch the Web UI as Temporal automatically executes the compensating transaction

Question to consider: How would you handle this scenario in a traditional microservices architecture?

Summary: What You Accomplished

Congratulations! You've experienced firsthand why Temporal is a game-changer for reliable applications. Here's what you demonstrated:

What You Learned

Crash-Proof Execution

You killed a Worker mid-transaction and watched Temporal recover seamlessly. Traditional applications would lose this work entirely, requiring complex checkpointing and recovery logic.

Live Production Debugging

You fixed a bug in running code without losing any state. Most systems require you to restart everything, losing all progress and context.

Automatic Retry Management

Temporal handled retries intelligently based on your policy, without cluttering your business logic with error-handling code.

Complete Observability

The Web UI gave you full visibility into every step, retry attempt, and state transition. No more debugging mysterious failures.

Summary

Successfully recovered from a Worker crash

Fixed a bug in a running Workflow

Observed automatic retry behavior

Used the Web UI for debugging

Experienced zero data loss through failures

Advanced Challenges

Try these advanced scenarios:

tip

Mission: Compensating Transactions

Modify the retry policy in workflows.py to only retry 1 time
Force the deposit to fail permanently
Watch the automatic refund execute

Mission objective: Prove that Temporal can handle complex business logic flows even when things go wrong.

tip

Mission: Network Partition Simulation

Start a long-running Workflow
Disconnect your network (or pause the Temporal Server container)
Reconnect after 30 seconds

Mission objective: Demonstrate Temporal's resilience to network failures.

Knowledge Check

Test your understanding of what you just experienced:

Q: Why do we use a shared constant for the Task Queue name?

Answer: Because the Task Queue name connects your Workflow starter to your Worker. If they don't match exactly, your Worker will never see the Workflow tasks, and execution will stall indefinitely.

Real-world impact: This is like having the wrong radio frequency - your messages never get delivered.

Q: What happens when you modify Activity code for a running Workflow?

Answer: You must restart the Worker to load the new code. The Workflow will continue from where it left off, but with your updated Activity logic.

Real-world impact: This enables hot-fixes in production without losing transaction state.

Continue Your Learning

Build from Scratch →

Learn to create Temporal apps step by step

→

Take a Course →

Comprehensive learning paths

→

Python SDK Guide →

Complete developer documentation

→

Experiment 1 of 2: Crash Recovery Test​

What We're Testing

Before You Start​

Instructions​

Experiment 2 of 2: Live Bug Fixing​

Live Debugging Flow

Before You Start​

Instructions​

Summary: What You Accomplished​

What You Learned​

Summary​

Advanced Challenges​

Knowledge Check​

Continue Your Learning​

Build from Scratch →

Take a Course →

Python SDK Guide →

Experiment 1 of 2: Crash Recovery Test

Before You Start

Instructions

Experiment 2 of 2: Live Bug Fixing

Before You Start

Instructions

Summary: What You Accomplished

What You Learned

Summary

Advanced Challenges

Knowledge Check

Continue Your Learning