Datto RMM: monitoring, alerting, and runbooks (from alert to remediation)

Note: Menu labels can vary slightly by Datto RMM version/tenant, but the overall flow (Sites → Devices → Policies → Monitors → Alerts → Automation) stays the same.

Goal

Set up monitoring that is consistent and actionable:

Detect (Monitoring): CPU/RAM/Disk, Windows services, critical events, availability.
Notify (Alerting): priorities, routing, anti-noise, escalation.
Remediate (Runbooks / Quick Jobs): standardized actions with traceability.

Prerequisites

An account with permissions for Sites, Policies, Monitors, Alerts, Automation.
At least one test device with the Datto RMM agent installed.
A naming convention (example):
- Monitors: MON-<TYPE>-<WHAT>-<SEVERITY>
- Policies: POL-<SITE>-<ROLE>
- Quick Jobs: QJ-<OS>-<ACTION>

Step 1 — Structure Sites & Devices

Open Sites in the left menu.
Ensure each customer/entity has a dedicated Site.
Open a pilot site.
Go to Devices and select a pilot workstation/server.
Verify baseline data: OS, last reboot, agent version, AV status, patch status.

Good practices

Separate Servers and Workstations using filters / groups.
Use UDFs (custom fields) for: criticality, owner, maintenance window, escalation contact.

Step 2 — Create a baseline Policy

The idea: one “foundation” policy per OS/role.

Go to Policies.
Click New Policy.
Name it (e.g.) POL-BASE-WIN10.
Configure:
- Patch Management: patch window + controlled reboot rules.
- Monitoring: attach core monitors (see Step 3).
- Automation: attach standard jobs (see Step 5).
Save.

Step 3 — Create Monitors (detection)

3.1 Disk space (capacity)

Why: avoid “full disk” incidents.
Suggested thresholds (adapt to your environment):
- Warning: < 15% free
- Critical: < 10% free

Implementation

Go to Monitors → New Monitor.
Choose Disk Usage (or equivalent).
Target: C: (and key volumes on servers).
Configure Warning/Critical thresholds.
Customize the alert message to include % free, GB free, device, site.

3.2 Critical Windows services

Examples: Spooler (print server), MSSQLSERVER, W3SVC (IIS), LanmanServer.

New Monitor → type Service.
Service name: MSSQLSERVER.
Condition: Not running.
(Optional) attach remediation via Automation/Quick Job (see Step 5).

3.3 Patch compliance

Create/enable a monitor related to Patch Status / Reboot required.
Trigger “warning” for approved pending / “critical” for overdue.
Pair this with a scheduled patch window and clear reboot rules.

Step 4 — Alert routing and noise control

4.1 Severity & ownership

In your monitor, define the severity (Warning vs Critical).
Route alerts by:
- Site (customer)
- Role (server vs workstation)
- Category (security vs availability)

4.2 Reduce alert fatigue

Use at least 3 layers:

Deduplication / cool-down: do not open 20 identical alerts for the same disk.
Time windows: avoid alerts during maintenance.
Escalation: N1 handles, N2 on-call only if not acknowledged within X minutes.

Step 5 — Runbooks / Quick Jobs (remediation)

A runbook should be safe, repeatable, and logged.

5.1 Typical runbooks

Restart a service: Restart-Service MSSQLSERVER
Clear temporary files (disk remediation)
Force update policies / agent tasks
Trigger Windows Update scan / report

5.2 Example: restart a service (Windows)

Go to Automation (or Quick Jobs).
Create a new job QJ-WIN-Restart-MSSQLSERVER.
Use PowerShell (example):

# Restart MSSQLSERVER safely
Restart-Service -Name "MSSQLSERVER" -Force
Start-Sleep -Seconds 10
Get-Service -Name "MSSQLSERVER" | Select-Object Status, Name

Configure logging/output capture.
Scope it to a test device first.
Attach the job as an auto-remediation for the service monitor.

Step 6 — Validation checklist

For each monitor/runbook, validate:

The monitor triggers as expected (simulate a stop-service or low disk threshold).
The alert arrives to the right channel/team.
The runbook executes and logs output.
The incident is closed with traceability (what ran, when, result).

Step 7 — Documentation (runbooks library)

Keep a short “operator-friendly” doc per runbook:

Goal, prerequisites, safety checks
How to run manually
Expected output / rollback
When to escalate

Note: Menu labels can vary slightly by Datto RMM version/tenant, but the overall flow (Sites → Devices → Policies → Monitors → Alerts → Automation) stays the same.

Goal

Set up monitoring that is consistent and actionable:

Detect (Monitoring): CPU/RAM/Disk, Windows services, critical events, availability.
Notify (Alerting): priorities, routing, anti-noise, escalation.
Remediate (Runbooks / Quick Jobs): standardized actions with traceability.

Prerequisites

An account with permissions for Sites, Policies, Monitors, Alerts, Automation.
At least one test device with the Datto RMM agent installed.
A naming convention (example):
- Monitors: MON-<TYPE>-<WHAT>-<SEVERITY>
- Policies: POL-<SITE>-<ROLE>
- Quick Jobs: QJ-<OS>-<ACTION>

Step 1 — Structure Sites & Devices

Open Sites in the left menu.
Ensure each customer/entity has a dedicated Site.
Open a pilot site.
Go to Devices and select a pilot workstation/server.
Verify baseline data: OS, last reboot, agent version, AV status, patch status.

Good practices

Separate Servers and Workstations using filters / groups.
Use UDFs (custom fields) for: criticality, owner, maintenance window, escalation contact.

Step 2 — Create a baseline Policy

The idea: one “foundation” policy per OS/role.

Go to Policies.
Click New Policy.
Name it (e.g.) POL-BASE-WIN10.
Configure:
- Patch Management: patch window + controlled reboot rules.
- Monitoring: attach core monitors (see Step 3).
- Automation: attach standard jobs (see Step 5).
Save.

Step 3 — Create Monitors (detection)

3.1 Disk space (capacity)

Why: avoid “full disk” incidents.
Suggested thresholds (adapt to your environment):
- Warning: < 15% free
- Critical: < 10% free

Implementation

Go to Monitors → New Monitor.
Choose Disk Usage (or equivalent).
Target: C: (and key volumes on servers).
Configure Warning/Critical thresholds.
Customize the alert message to include % free, GB free, device, site.

3.2 Critical Windows services

Examples: Spooler (print server), MSSQLSERVER, W3SVC (IIS), LanmanServer.

New Monitor → type Service.
Service name: MSSQLSERVER.
Condition: Not running.
(Optional) attach remediation via Automation/Quick Job (see Step 5).

3.3 Patch compliance

Create/enable a monitor related to Patch Status / Reboot required.
Trigger “warning” for approved pending / “critical” for overdue.
Pair this with a scheduled patch window and clear reboot rules.

Step 4 — Alert routing and noise control

4.1 Severity & ownership

In your monitor, define the severity (Warning vs Critical).
Route alerts by:
- Site (customer)
- Role (server vs workstation)
- Category (security vs availability)

4.2 Reduce alert fatigue

Use at least 3 layers:

Deduplication / cool-down: do not open 20 identical alerts for the same disk.
Time windows: avoid alerts during maintenance.
Escalation: N1 handles, N2 on-call only if not acknowledged within X minutes.

Step 5 — Runbooks / Quick Jobs (remediation)

A runbook should be safe, repeatable, and logged.

5.1 Typical runbooks

Restart a service: Restart-Service MSSQLSERVER
Clear temporary files (disk remediation)
Force update policies / agent tasks
Trigger Windows Update scan / report

5.2 Example: restart a service (Windows)

Go to Automation (or Quick Jobs).
Create a new job QJ-WIN-Restart-MSSQLSERVER.
Use PowerShell (example):

# Restart MSSQLSERVER safely
Restart-Service -Name "MSSQLSERVER" -Force
Start-Sleep -Seconds 10
Get-Service -Name "MSSQLSERVER" | Select-Object Status, Name

Configure logging/output capture.
Scope it to a test device first.
Attach the job as an auto-remediation for the service monitor.

Step 6 — Validation checklist

For each monitor/runbook, validate:

The monitor triggers as expected (simulate a stop-service or low disk threshold).
The alert arrives to the right channel/team.
The runbook executes and logs output.
The incident is closed with traceability (what ran, when, result).

Step 7 — Documentation (runbooks library)

Keep a short “operator-friendly” doc per runbook:

Goal, prerequisites, safety checks
How to run manually
Expected output / rollback
When to escalate

Datto RMM: monitoring, alerting, and runbooks (from alert to remediation)

Goal

Prerequisites

Step 1 — Structure Sites & Devices

Step 2 — Create a baseline Policy

Step 3 — Create Monitors (detection)

3.1 Disk space (capacity)

3.2 Critical Windows services

3.3 Patch compliance

Step 4 — Alert routing and noise control

4.1 Severity & ownership

4.2 Reduce alert fatigue

Step 5 — Runbooks / Quick Jobs (remediation)

5.1 Typical runbooks

5.2 Example: restart a service (Windows)

Step 6 — Validation checklist

Step 7 — Documentation (runbooks library)

Related posts

Datto RMM: monitoring, alerting, and runbooks (from alert to remediation)

Goal

Prerequisites

Step 1 — Structure Sites & Devices

Step 2 — Create a baseline Policy

Step 3 — Create Monitors (detection)

3.1 Disk space (capacity)

3.2 Critical Windows services

3.3 Patch compliance

Step 4 — Alert routing and noise control

4.1 Severity & ownership

4.2 Reduce alert fatigue

Step 5 — Runbooks / Quick Jobs (remediation)

5.1 Typical runbooks

5.2 Example: restart a service (Windows)

Step 6 — Validation checklist

Step 7 — Documentation (runbooks library)

Related posts