# Incident Response

## Incident Response

**Outcome:** Production incident diagnosed, resolved within SLA, and documented to prevent recurrence **Trigger:** Error alert in Slack, client report, or issue detected during monitoring **Duration:** Variable (15 min to several hours depending on severity)

***

### Quick Reference

| Step | Action                            | Time      |
| ---- | --------------------------------- | --------- |
| 1    | Acknowledge and classify severity | 2 min     |
| 2    | Diagnose root cause               | 5-30 min  |
| 3    | Resolve or apply workaround       | 15-60 min |
| 4    | Verify fix                        | 5-10 min  |
| 5    | Notify client (if needed)         | 5 min     |
| 6    | Document the incident             | 10 min    |

***

### Severity Classification

| Severity          | Definition                                                              | Response Time   | Resolution Target | Client Notification    |
| ----------------- | ----------------------------------------------------------------------- | --------------- | ----------------- | ---------------------- |
| **P1 - Critical** | System down, data loss risk, or client-facing service completely broken | Within 1 hour   | 4 hours           | Immediately            |
| **P2 - High**     | Major feature broken, workaround exists but is painful                  | Within 4 hours  | 24 hours          | Same day               |
| **P3 - Medium**   | Minor issue, workaround exists, limited impact                          | Within 24 hours | This week         | Only if client noticed |
| **P4 - Low**      | Cosmetic, no functional impact                                          | Within 48 hours | Scheduled         | No                     |

**Examples by severity:**

| P1                                 | P2                                       | P3                                          | P4                                    |
| ---------------------------------- | ---------------------------------------- | ------------------------------------------- | ------------------------------------- |
| All workflows stopped              | One workflow failing consistently        | Intermittent timeout on one endpoint        | Log formatting issue                  |
| Data being written to wrong system | OAuth token expired, no data flowing     | Duplicate notification sent once            | Slow response time (still within SLA) |
| Credential leak detected           | Missing data in output (partial failure) | Error trigger firing for non-critical issue | Outdated comment in workflow          |

***

### Full Guide

#### Step 1: Acknowledge and Classify (2 min)

When an incident is detected:

1. **Check the source** -- Slack error channel, client email, or monitoring alert
2. **Classify severity** using the table above
3. **If P1/P2:** Stop all other work immediately

**For Slack alerts:**

* Error messages include: error text, failing node name, timestamp, and execution URL
* Click the execution URL to see the full error context in n8n

#### Step 2: Diagnose Root Cause (5-30 min)

**Diagnostic Checklist:**

* [ ] Check n8n execution history -- what's the error message?
* [ ] Check Modal endpoint health -- `curl https://the-entourage-ai--{app-name}-web-app.modal.run/health`
* [ ] Check credential status -- has an OAuth token expired?
* [ ] Check external service status -- is the third-party API down?
* [ ] Check for recent changes -- was a deployment made? Did the client change something?
* [ ] Check error patterns -- is this a one-off or recurring?

**Common Root Causes:**

| Symptom               | Likely Cause                                  | Diagnostic Step                                                 |
| --------------------- | --------------------------------------------- | --------------------------------------------------------------- |
| 401/403 errors        | Expired or revoked credentials                | Check n8n credentials page, verify token validity               |
| Timeout errors        | External API slow or Modal cold start         | Check Modal dashboard for container restarts, test API directly |
| Data format errors    | Upstream system changed output format         | Compare current input against expected Pydantic model           |
| "Workflow not found"  | Workflow accidentally deactivated             | Check n8n workflow toggle status                                |
| Intermittent failures | Rate limiting or network issues               | Check error frequency pattern, add retry logic if needed        |
| All workflows failing | Infrastructure issue (n8n, Modal, or network) | Check n8n health, Modal dashboard, DNS status                   |

#### Step 3: Resolve or Apply Workaround (15-60 min)

**Resolution paths by cause:**

| Cause                  | Resolution                                                                            |
| ---------------------- | ------------------------------------------------------------------------------------- |
| Expired OAuth token    | Re-authorize in n8n. If client re-auth needed, contact client with clear instructions |
| Expired API key        | Regenerate from 1Password, update in n8n                                              |
| External API changed   | Update endpoint code, redeploy: `modal app stop {name}` then `modal deploy`           |
| Modal endpoint crashed | Check logs, fix code, redeploy with `modal deploy`                                    |
| Data format changed    | Update Pydantic models, add validation for new format, redeploy                       |
| Rate limiting          | Add exponential backoff, adjust concurrency settings                                  |
| n8n workflow error     | Fix node configuration, test with manual execution                                    |
| Infrastructure outage  | Wait for external service recovery, set up monitoring for status page                 |

**If the fix requires a code change:**

1. Make the change in `endpoints/` or `modal_apps/`
2. Run tests: `pytest clients/{client}/tests/ -v`
3. Deploy: `modal deploy clients/{client}/modal_apps/{name}_modal.py`
4. Verify the fix (Step 4)

**If a workaround is needed while investigating:**

* Temporarily disable the failing workflow to prevent bad data
* Notify the client that we're aware and working on it
* Set a specific timeline for the permanent fix

#### Step 4: Verify Fix (5-10 min)

* [ ] Trigger the workflow manually with test data
* [ ] Confirm successful execution in n8n
* [ ] Check output is correct in the destination system
* [ ] Monitor for 3-5 subsequent runs to confirm stability
* [ ] If P1: monitor for 1 hour after fix

#### Step 5: Notify Client (if needed)

**P1/P2 -- Immediate notification:**

```
Subject: Resolved: {Brief Issue Description}

Hi {Name},

We detected an issue with your {Workflow Name} workflow and have resolved it.

WHAT HAPPENED
{1-2 sentence explanation in plain language}

IMPACT
{What was affected -- e.g., "3 invoices from this morning were not processed"}

RESOLUTION
{What we did to fix it}

DATA RECOVERY
{If applicable: "We've reprocessed the 3 missed invoices -- all are now in your system"}

PREVENTION
{What we're doing to prevent this from happening again}

We're monitoring closely and will confirm everything is running smoothly.

Cheers,
{Your Name}
AI Solutions Engineer
The Entourage AI
```

**P3 -- Mention in next Friday report or brief email if client noticed.**

**P4 -- No client notification needed.**

#### Step 6: Document the Incident (10 min)

Create: `clients/{CODE}/context/operations/incident-{DATE}-{brief-desc}.md`

```markdown
# Incident: {Brief Description}
Date: {DATE}
Severity: {P1/P2/P3/P4}
Duration: {Time from detection to resolution}

## Timeline
- {HH:MM} - Error detected via {source}
- {HH:MM} - Diagnosis started
- {HH:MM} - Root cause identified: {cause}
- {HH:MM} - Fix deployed
- {HH:MM} - Verified resolved

## Root Cause
{Detailed explanation}

## Resolution
{What was done}

## Data Impact
{Any data affected -- missed runs, incorrect outputs, etc.}

## Prevention
{Steps taken to prevent recurrence}
- [ ] {Action item 1}
- [ ] {Action item 2}
```

**If this is a recurring error type:** Create or update a learning file in `learnings/ops/` (see [Ongoing Operations](https://internal-docs.theentourageai.com/ai-accelerator/ongoing-operations) for the learning file template).

***

### Billing for Incident Response

| Cause                                            | Billing                                      |
| ------------------------------------------------ | -------------------------------------------- |
| Bug in our code                                  | **Free** -- does not count toward allocation |
| External system change                           | Billable -- uses allocation or $250/hr       |
| Client-caused (changed settings, revoked access) | Billable -- uses allocation or $250/hr       |

**Always clarify the cause before discussing billing.** If it's our bug, fix it and move on. If it's external, explain the cause and get approval before using billable hours.

***

### Escalation

| Situation                                                | Escalation Path                                       |
| -------------------------------------------------------- | ----------------------------------------------------- |
| P1 not resolved within 4 hours                           | Escalate to Ben for client communication              |
| Data loss confirmed                                      | Notify Ben immediately                                |
| Third-party service outage (no ETA)                      | Notify client, monitor vendor status page             |
| Security incident (credential leak, unauthorized access) | Stop all workflows, notify Ben, begin security review |

***

### Verify

* [ ] Incident severity classified correctly
* [ ] Root cause identified
* [ ] Fix deployed and verified
* [ ] Client notified (if P1/P2)
* [ ] Incident documented
* [ ] Prevention steps identified
* [ ] Learning file created/updated (if recurring)
* [ ] Billing classification noted (free vs. billable)