The problem with undocumented IT procedures
Every IT support team handles the same incidents repeatedly. Password resets. Connectivity issues. Server backup failures. Cloud service outages. These are not rare edge cases. They are the daily rhythm of IT support, and they are predictable enough to document properly.
A documented approach to recurring IT incidents reduces resolution time, limits human error, and creates consistency across your support operation. Whether you are managing IT for a small business or handling support for multiple clients, a structured collection of documented procedures makes a measurable difference to how efficiently your team operates.
This guide covers what an effective IT support runbook library looks like in practice, the categories worth prioritising, how to maintain your documentation over time, and the tools that make it manageable without adding unnecessary overhead.
What is an IT support runbook library
A runbook library is a centralised collection of step-by-step procedures for handling recurring IT incidents and service requests. Each runbook documents a specific procedure: what to check first, what commands to run, what outcomes to verify, and when to escalate.
The goal is not to document every possible scenario. It is to document the procedures performed frequently enough, or with enough complexity, that written steps genuinely improve consistency and reduce errors. A well-maintained runbook library is one of the most practical investments you can make in IT operational efficiency.
If your team is small, this documentation also serves as a training resource. New team members can work through runbooks independently rather than requiring constant supervision. This directly supports a structured approach to IT onboarding for new staff and contractors, giving them a reliable reference during their first weeks.
What makes a runbook effective
Five characteristics of a good runbook
A runbook that nobody follows is not a runbook. It is a document that pretends to be one. Effective runbooks share five characteristics that distinguish them from vague, outdated, or unusable documentation.
- Actionable: every step is a specific action, not a vague principle. "Check the account status" is not a step. "Navigate to the user blade in Azure AD and confirm the Account enabled checkbox is selected" is a step.
- Environment-accurate: screenshots, commands, file paths, and menu locations match the actual environment your team works in. Generic screenshots erode trust in the documentation.
- Atomic steps: each step does one thing. If a step has multiple conditions or branching logic, decompose it into separate steps with clear decision points.
- Verifiable: each step has an observable outcome that confirms success. If you cannot verify that a step worked, you cannot verify that the procedure is complete.
- Maintained: the runbook is reviewed and updated when the underlying system changes. A runbook that nobody updates becomes actively misleading.
Runbook versus playbook versus knowledge base
These terms appear frequently in IT documentation discussions and they are sometimes used interchangeably. They serve distinct purposes and confusing them leads to documentation that does not do its job.
A runbook is a step-by-step procedural document for a specific operational task. It answers "how do I do X?" A playbook is a decision-driven document that guides escalation and response to an incident type. It answers "what do I do when Y happens?" A knowledge base article is explanatory content that helps understanding but is not necessarily a procedure. It answers "why does Z work this way?"
Keeping these three document types separate makes each of them more useful. Your runbooks should be lean procedural references. Your playbooks should guide decision-making. Your knowledge base should support learning and understanding.
Essential runbook categories for IT support
Not every procedure deserves equal documentation effort. The highest-value runbooks cover procedures that are frequent, risky if performed incorrectly, or complex enough that steps are easy to forget under pressure. The following categories represent the most impactful areas to document for most IT support operations.
Authentication and identity incidents
Password resets and account lockouts typically account for a significant proportion of level 1 IT support volume in most organisations. Automating what can be automated and documenting what cannot is the foundation of a runbook library.
Identity incidents also carry security implications. A runbook in this category should include clear escalation criteria for situations that might indicate compromised credentials rather than simple user error.
Runbook: Password Reset for Domain-Joined Workstation (Remote)
PURPOSE
Reset a domain user password when the user cannot self-reset via SSO.
SCOPE
Remote reset via Microsoft 365 / Azure AD hybrid environment.
PREREQUISITES
Support agent has password reset permissions assigned in Azure AD.
TIME TO COMPLETE
5 to 8 minutes.
STEPS
1. Verify caller identity:
- Confirm full name and employee ID from HR records.
- Use a personal identification question if your organisation has implemented one.
- Log the caller identity in the ticket before proceeding.
2. Check account status in Azure AD:
- Navigate to the Entra admin centre, locate the user, and confirm the account exists.
- Note the User Principal Name (UPN).
- Confirm the account status shows as enabled.
3. Assess lockout status:
- If the account is locked (not just expired): check the last sign-in and reason for lockout.
- If the lockout appears anomalous with no recent legitimate sign-in: escalate to the security team before resetting.
- If the lockout is legitimate (user confirms they forgot their password): proceed to step 4.
4. Reset the password:
- Click Reset password in the Azure AD user blade.
- Uncheck "Require this user to change password on next sign-in" only for managed service accounts or where the user explicitly cannot change their password, coordinating with their manager first.
- Generate a temporary password following your company password policy format.
5. Communicate the password securely:
- Do not send the password via email or chat.
- Call the user directly on their registered mobile number.
- Read the temporary password aloud and confirm they can log in.
6. Verify successful login:
- Ask the user to attempt login at the Microsoft portal.
- Confirm they are prompted to change their password.
- Document the new password expiry date in the ticket.
7. Update the ticket:
- Resolution: password reset completed.
- Note: user confirmed successful login.
- Note: expiry policy applied.
Network and connectivity incidents
Remote work has increased the frequency of network connectivity troubleshooting. When a user works from home, the possible root causes extend beyond the office network to include home routers, ISP issues, and VPN configuration problems. A structured runbook prevents technicians from chasing the wrong problem.
Runbook: No Internet Connectivity (Remote Troubleshooting)
PURPOSE
Diagnose and resolve internet connectivity loss for a remote (WFH) user.
SCOPE
Remote troubleshooting via Teams or phone for a WFH Windows workstation.
PREREQUISITES
User has remote access software installed and functioning.
ESTIMATED TIME
10 to 20 minutes depending on root cause.
STEPS
1. Confirm the scope of the issue:
- Ask the user if they can access any website in their browser.
- Ask if they can access internal resources such as VPN or shared drives.
- This determines whether the issue is full internet loss or partial (VPN-specific).
2. Check local network status:
- Guide the user to open Network and Internet settings from the system tray.
- Note the connection type: Wi-Fi or Ethernet.
- Check if the status shows "No internet" or "Connected, no internet".
3. Flush DNS and reset the network stack:
- Guide the user to open Command Prompt as Administrator.
- Run the following commands in sequence:
ipconfig /flushdns
ipconfig /registerdns
ipconfig /release
ipconfig /renew
netsh winsock reset
- Ask the user to restart the computer and reconnect.
4. Test with alternative DNS:
- After restart, run: netsh interface ip set dns name="Wi-Fi" static 8.8.8.8
- Ask the user to test with: ping google.com
- If the ping succeeds: DNS was the issue. Consider setting static DNS to 8.8.8.8 and 8.8.4.4.
- If the ping fails: escalate to your network team for ISP-level diagnostics.
ESCALATION CRITERIA
- Issue persists after Steps 3 and 4: escalate to the network team.
- User reports VPN is also down alongside full connectivity loss: escalate immediately.
- User is on the corporate LAN (not working from home): escalate immediately with topology details.
Endpoint malware and suspicious activity
This category carries the highest risk if handled incorrectly. A runbook for suspected malware must include clear escalation criteria and must make clear what your support team should not attempt without specialist involvement. Attempting manual malware removal on a domain-joined workstation can spread infection or destroy forensic evidence.
Runbook: Endpoint Suspicious Activity (Malware Suspected)
PURPOSE
Contain and investigate suspected malware on a managed endpoint.
SCOPE
Managed Windows endpoint with active Defender or a third-party EDR solution.
PREREQUISITES
Support engineer has endpoint management access and EDR console access.
WARNING
Do not attempt manual malware removal on a domain-joined workstation. Malware that has achieved local admin or domain credentials requires specialist handling.
STEPS
1. Verify the alert:
- Log into the EDR console (Microsoft Defender, CrowdStrike, Sentinel, or equivalent).
- Locate the alert for the affected machine.
- Review the alert details including MITRE ATT&CK technique, process tree, and affected user.
- Determine whether the alert is confirmed malicious or a false positive.
2. Initial containment:
- If the alert is confirmed malicious: isolate the machine immediately from the EDR console.
- Do not power off the machine as this preserves forensic memory.
- Document: machine name, username, alert ID, time of isolation, and analyst name.
3. Gather initial evidence:
- Capture running processes: tasklist /v > %username%\desktop\processes.txt
- Capture network connections: netstat -anob > %username%\desktop\netstat.txt
- Capture scheduled tasks: schtasks /query /fo LIST /v > %username%\desktop\schtasks.txt
- Screenshot the EDR alert details page.
4. Engage specialist escalation:
- Contact your security team or managed security provider immediately.
- Provide: machine name, username, alert details, and evidence package.
- Do not attempt remediation without security team authorisation.
- If the machine is domain-joined: assume domain credentials may be compromised and escalate for credential reset for the affected user account.
5. Post-containment:
- Once the security team has engaged: follow their guidance.
- Document all actions taken in your ticketing system with timestamps.
- Prepare indicators of compromise (IOCs) for the security team review.
Backup and recovery
Backups are only as reliable as your verification process. A runbook that documents how to confirm a backup completed successfully and how to test a restore is essential for any business that depends on its data. This is one of those areas where you do not want to discover a failure at the moment you need to recover.
Runbook: Verify Backup Completion and Test Restore
PURPOSE
Confirm nightly backup completed successfully and validate restore capability for a critical server.
SCOPE
Veeam Backup for physical, VMware, or Hyper-V Windows server.
PREREQUISITES
Backup operator role in Veeam console; read access to the server.
FREQUENCY
Monthly or as specified in your backup SLA.
STEPS
1. Review backup job status:
- Open the Veeam Backup and Replication console.
- Navigate to the Jobs section and locate the relevant backup job.
- Confirm the last run status shows "Success".
- Note the number of VMs processed, data size, and duration.
- If the status shows "Failed" or "Warning": document the error code and proceed to the troubleshooting runbook.
2. Verify backup file integrity:
- Right-click the last successful backup point and select Properties.
- Confirm the "Data is valid" status is shown.
- Note the restore point age. It should be within your SLA window, typically within 24 hours.
3. Test file-level restore:
- Right-click the backup and select Restore Guest Files, then Microsoft Windows.
- Select the latest restore point.
- Browse to a non-critical folder such as C:\Windows\Temp.
- Restore 3 to 5 test files to an alternate location such as C:\RestoreTest.
- Verify the files are readable and intact.
- Delete the test files after verification.
4. Document results:
- Update your monitoring system with the backup health status.
- In your ticketing system: note that backup verification was completed and files verified intact.
- If any anomalies are found: open an incident ticket with your findings.
ESCALATION
If a backup has failed or the restore test failed: open a priority incident immediately and engage your backup vendor support.
SaaS and cloud service outages
When a cloud service you rely on experiences an outage, your users will contact your support team first. A runbook in this category helps your team respond quickly, communicate effectively, and evaluate workarounds without wasting time.
Runbook: Cloud Service Degradation Response (Microsoft 365 / Azure)
PURPOSE
Respond to a reported or confirmed outage or degradation of a cloud service.
SCOPE
Microsoft 365, Azure, or another cloud service used by the organisation.
PREREQUISITES
Service health portal access via the Microsoft admin centre.
STEPS
1. Confirm the issue:
- Check the Microsoft Service Health Dashboard in the admin centre.
- Look for active incidents for the affected service such as Teams, Exchange, or SharePoint.
- If no incident is listed: check the scope reported by users (one user, one team, everyone).
2. Check for upstream confirmation:
- Check the vendor's official status page for the affected service.
- Search for incident reports from other organisations using the same service.
- Check third-party outage monitoring sites for user-reported patterns.
3. Communicate with affected users:
- If a vendor outage is confirmed: notify affected users via an unaffected channel. If Teams is down, use email or SMS.
- Set expectations by referencing the vendor's incident communication for an estimated resolution time.
- Document the start time of the incident in your ticketing system.
4. Evaluate workarounds:
- If email is affected: activate secondary email routing if available.
- If Teams is affected: provide dial-in conference bridge alternatives if your organisation has them.
- Document the workaround in your all-user communication.
5. Monitor resolution:
- Refresh the Service Health Dashboard every 15 minutes during an active incident.
- Update affected users when the vendor marks the incident as resolved.
- Close the incident ticket with resolution details: vendor incident reference number, start time, resolution time, and any actions your team took.
Maintaining your runbook library over time
Why runbooks decay
Runbooks decay. A runbook that is not reviewed when the underlying system changes becomes actively dangerous. It trains a technician to perform steps that no longer reflect reality. Without a maintenance discipline, runbooks provide false confidence and can lead to errors rather than preventing them.
The solution is not to write perfect runbooks. It is to build a process that keeps them current. Documentation written once and never touched is worse than no documentation at all, because it creates an illusion of process where none exists.
When to update a runbook
Certain events should trigger a runbook review within a defined timeframe. Building these triggers into your change management and incident review processes keeps your documentation alive.
- System change: any change to the system, whether a patch, upgrade, or configuration change, requires a review of all affected runbooks within 48 hours of the change going live.
- Incident review: if an incident was handled poorly or a runbook contributed to an error, the runbook must be reviewed and updated before the incident is formally closed.
- Scheduled review: all runbooks should be reviewed at minimum annually. High-risk procedures should be reviewed quarterly.
- Tool change: when a tool is replaced, the associated runbooks must be rewritten or retired. Do not leave orphaned references to tools that no longer exist in your environment.
Runbook ownership
Each runbook should have a named owner. This is typically the IT team member most familiar with the procedure. The owner's responsibility is to keep the runbook current and to be the escalation point when the procedure needs updating.
Without named ownership, runbooks become orphaned. They sit in a shared folder with no clear responsible person and gradually drift out of sync with reality. Assigning ownership creates accountability and ensures somebody actually cares whether the documentation is accurate.
Tools for managing a runbook library
The tool you use matters less than the discipline you apply to maintaining the content. That said, some tools make it easier to keep runbooks current, searchable, and accessible to your team.
- IT Glue: purpose-built for IT documentation and runbooks, with strong integrations into professional services automation tools used by managed service providers.
- Confluence: well-suited for larger teams with strong search functionality, templates, and permission controls. Works well if your organisation already uses the Atlassian ecosystem.
- Notion: flexible and fast to use. Excellent for small teams and the free tier is sufficient for smaller workspaces. The downside is that access controls are less granular than enterprise tools.
- SharePoint or OneDrive: a practical option for organisations already using Microsoft 365. Requires discipline to maintain folder structure and naming conventions consistently.
- Plain text files or Markdown: viable for very small teams if naming conventions and folder structure are applied consistently. This approach requires the least overhead but the most discipline.
For smaller operations or solo practitioners, starting with a simple shared folder structure using clear naming conventions is often the right move. Overly complex tooling can become a barrier to actually writing and maintaining the documentation. As your operation grows, you can migrate to a purpose-built documentation platform.
Connecting runbook documentation to your broader IT operation
A runbook library does not exist in isolation. It is part of a larger system of IT documentation that includes network diagrams, asset registers, onboarding procedures, and knowledge base articles.
If you are building out your IT documentation practices, it is worth considering how runbooks fit alongside your IT documentation strategy. Documentation that nobody reads is documentation that has failed, regardless of how technically accurate it is. Writing documentation that people actually use requires thinking about format, access, and maintenance alongside content.
For IT support teams managing multiple clients, the discipline of maintaining a runbook library also connects to how you structure your IT support contracts. When procedures are documented, it is easier to define what is included in a support agreement, what response times are realistic, and where the boundary sits between your responsibilities and the client's.
Common mistakes when building a runbook library
Several patterns consistently undermine runbook initiatives. Avoiding these from the start will save significant frustration and prevent your documentation effort from stalling.
- Documenting too much too soon: teams that try to document every procedure end up with a large volume of low-quality runbooks that nobody maintains. Start with the five to ten procedures that consume the most support time or carry the most risk if performed incorrectly.
- No ownership assigned: without named owners, runbooks drift out of date. Every runbook needs a person responsible for its accuracy.
- Copying vendor documentation verbatim: generic runbooks that repeat what vendors already document are not useful. Your runbooks should cover your specific environment, your naming conventions, your escalation paths, and your organisation's particular quirks.
- Treating documentation as a one-time project: documentation is an ongoing operational requirement. If you treat it as a project with an end date, the maintenance will lapse the moment the project closes.
- No review process: without scheduled reviews, runbooks will not stay current. Build reviews into your routine operational cadence.
What to do next
A runbook library is only as valuable as the discipline applied to maintaining it and the culture that treats it as an operational necessity rather than a compliance checkbox. The compound effect over 12 months is significant: faster resolution times, consistent quality regardless of which technician handles the ticket, and a training resource that reduces ramp time for new team members.
An organisation with 20 well-maintained runbooks covering the right procedures will consistently outperform one with 200 outdated documents nobody trusts.
If you want help building or reviewing your IT documentation, you can get in touch with details of your current setup, the procedures that take most of your support time, and where you think documentation gaps are causing problems.