IT Operations Procedures Library

This section is visible to isms-it-staff and isms-security via Scroll Content Manager. It contains the operational procedures that IT Operations staff use to run, maintain, and evidence the infrastructure supporting the ISMS.

Library contents

OP-01 · Backup and Recovery
OP-02 · Certificate Management
OP-03 · Logging and SIEM Management
OP-04 · Infrastructure Change Management
OP-05 · BCM and DR Testing

OP-01 · Backup and Recovery

Purpose and scope

This procedure governs the backup, retention, and recovery of all CUI-scope data and systems. It implements NIST 800-171 control 3.8.9 (protect confidentiality of backup CUI), ISO 27001 Annex A 8.13 (information backup), and the organisation's business continuity obligations for CUI-scope systems.

Scope: all systems, applications, and data classified at Restricted or above. CUI-scope backup jobs are subject to the stricter requirements throughout this procedure. Non-CUI systems follow the same procedure but may use relaxed encryption and retention settings where noted.

Evidence generated: EV-D27 (daily backup logs), EV-D28 (quarterly restoration test records).

Backup architecture — the 3-2-1 standard

Three copies of all CUI data must exist at all times:

Copy 1 — Production (primary):
  Location: production storage (SAN / NAS / cloud-native storage)
  Encryption: at-rest encryption via KMS (AES-256, HSM-backed keys)
  Purpose: active working copy

Copy 2 — Local backup (secondary):
  Location: on-premises backup appliance in a different room from the 
            server — must NOT be in the same fire zone as production storage
  Encryption: AES-256, client-side before write — key in PAM vault
  Retention: 90 days rolling
  Purpose: fast local recovery for common failure scenarios

Copy 3 — Offsite backup (tertiary):
  Location: geographically separate cloud storage (UK or EEA region) 
            OR physical encrypted media in approved offsite vault
  Encryption: AES-256 client-side before upload — key NEVER held by 
              cloud provider; key in PAM vault (separate key from Copy 2)
  Retention: 12 months (full), extended to 7 years for archive tier
  Purpose: recovery from site-level disaster; satisfies DFARS 
           and DEFSTAN data preservation requirements

Verification:
  All three copies confirmed at daily job completion check (EV-D27)
  If Copy 3 is unavailable for >24 hours: escalate to IT Manager
  If Copy 3 is unavailable for >72 hours: escalate to CISO — 
    may require interim risk acceptance documentation

Backup jobs — configuration and schedule

Production data backup jobs

JOB: CUI-FileServer-Daily
  Source: \\[fileserver]\CUI-Share (all subdirectories)
  Schedule: Daily 01:00 UTC
  Type: Incremental (Monday–Saturday), Full (Sunday)
  Retention:
    Daily incrementals: 90 days
    Weekly fulls: 12 months
    Monthly fulls (last Sunday of month): 7 years
  Encryption: AES-256, client-side
  Backup software: [product name — e.g. Veeam / Commvault / Azure Backup]
  Key: stored in PAM vault under safe "Backup-Keys-CUI"
  Target:
    Primary: [local backup appliance name/IP]
    Copy job: [cloud storage account / bucket — UK region]

JOB: CUI-DatabaseServer-Daily
  Source: [database server] — all CUI-scope databases
  Schedule: Daily 00:30 UTC (before file server to avoid contention)
  Type: Full database backup + transaction log backup every 15 minutes
  Retention:
    Full backups: 90 days local, 12 months cloud
    Transaction logs: 30 days (enables point-in-time recovery)
  Encryption: Database-native TDE for local, then backup encryption for copy
  Target: same as CUI-FileServer-Daily

JOB: CUI-VMs-Daily
  Source: all VMs in [hypervisor cluster] tagged CUI-scope
  Schedule: Daily 02:00 UTC
  Type: Changed block tracking incremental; weekly full image backup
  Retention: 90 days (image); weekly fulls 12 months
  Encryption: AES-256 client-side
  Target: same as above

JOB: SaaS-M365-Weekly
  Source: Exchange Online mailboxes, SharePoint sites, OneDrive 
          (CUI-labelled content only via label filter)
  Schedule: Weekly Sunday 03:00 UTC
  Type: Full export via approved M365 backup tool
  Retention: 12 months
  Encryption: client-side via backup tool before cloud storage
  Note: M365 native retention policies are NOT a substitute for 
        this backup job — they do not provide point-in-time 
        recovery or immutable protection
  Target: dedicated backup cloud account (separate from production M365)

System image and configuration backups

JOB: NetworkDeviceConfig-Daily
  Source: all CUI-scope firewalls, routers, switches
  Schedule: Daily 04:00 UTC
  Type: automated configuration export via vendor API
  Tool: [Oxidized / RANCID / vendor NMS — specify deployed product]
  Storage: Git repository in configuration management platform (GitLab/ADO)
           (configuration repo has its own backup to cloud storage)
  Diff alerts: SIEM alert if configuration diff detected between daily exports
               (unexpected change indicator)
  Retention: full git history (perpetual)

JOB: CloudInfraState-Daily
  Source: Terraform state files; AWS CloudFormation stacks; 
          Azure ARM deployment states
  Schedule: Daily 05:00 UTC
  Type: state file export to versioned S3 bucket / Azure Blob with 
        versioning enabled
  Retention: 12 months (via object versioning lifecycle policy)

JOB: PKI-CertAuthority-Weekly
  Source: internal CA database; issued certificate records; CRL
  Schedule: Weekly Sunday 05:30 UTC
  Type: full CA database export
  Encryption: AES-256 — key held by CISO only (separate from standard backup keys)
  Storage: encrypted backup appliance + separate encrypted media in CISO's safe
  Retention: permanent (CA compromise history is a legal record)
  Note: CA backup tested separately from standard backup restoration test

Backup monitoring — daily operations (generates EV-D27)

Morning checks — daily by IT Operations (08:30 each working day)

Step 1 — Log into backup management console
  URL/path: [backup console URL]
  Authentication: PAM-mediated (standard account → PAM → backup console admin)

Step 2 — Review overnight job results
  Navigate to: Jobs → Last 24 hours → Sort by status

  Expected status per job:
    CUI-FileServer-Daily: Success
    CUI-DatabaseServer-Daily: Success + check transaction log chain is intact
    CUI-VMs-Daily: Success
    NetworkDeviceConfig-Daily: Success (check git commits in config repo)

  Acceptable statuses:
    Success: all data backed up; no errors
    Success with warnings: job completed but non-critical warnings present
      → Read warning details; document in EV-D27 note field; 
        resolve warnings within 2 business days if recurring

  Unacceptable statuses — act immediately:
    Failed: job did not complete
    Missed: job did not start
    → See Backup Failure Response below

Step 3 — Verify offsite copy completion
  Navigate to: Replication jobs OR check cloud storage console
  Confirm: last successful copy to cloud within past 24 hours
  If last cloud copy >24 hours ago: escalate to IT Manager

Step 4 — Check storage utilisation
  Backup appliance: confirm <80% utilised
  Cloud storage: check growth trend
  At 80%: create ITSM ticket for storage expansion — complete within 5 days
  At 90%: emergency escalation — backup jobs at risk of failure

Step 5 — Document results in EV-D27
  EV-D27 is maintained as a running log. Daily entry format:

  Date: [YYYY-MM-DD]
  Engineer: [name]
  Job results:
    CUI-FileServer-Daily: [Success / Warning / Failed]
    CUI-DatabaseServer-Daily: [Success / Warning / Failed]
    CUI-VMs-Daily: [Success / Warning / Failed]
    NetworkDeviceConfig-Daily: [Success / Warning / Failed]
    SaaS-M365-Weekly: [Success / Warning / Failed / Not scheduled today]
  Offsite copy: Confirmed [YYYY-MM-DD HH:MM UTC]
  Storage utilisation: [appliance %] / [cloud %]
  Issues: [None / description of any issue and action taken]

  File: EV-D · BCM → Backup Logs → [current month]

SIEM integration for backup monitoring

Configure the following alerts in the SIEM to provide continuous backup monitoring 
between morning checks:

Alert: Backup job failure
  Source: backup management console → syslog to SIEM
  Trigger: any job status = Failed or Missed
  Severity: High
  Response SLA: 2 hours (within business hours), 4 hours (out of hours)
  On-call: IT Operations on-call is notified for out-of-hours failures

Alert: Backup console authentication anomaly
  Source: backup console audit log → SIEM
  Trigger: login failure on backup console OR login from unexpected account
  Severity: High
  Rationale: backup infrastructure is a high-value target; 
             ransomware groups specifically target backup consoles

Alert: Cloud storage access from unexpected source
  Source: cloud storage access logs → SIEM
  Trigger: read or write operation on backup storage bucket/container 
           from any IP not in the approved backup agent IP list
  Severity: Critical
  Rationale: cloud backup bucket access by an unknown entity may indicate 
             ransomware exfiltration or attacker reconnaissance

Backup failure response procedure

When a CUI-scope backup job fails or is missed:

Severity assessment:
  Single job, single day, isolated failure → Significant (P3)
  Multiple jobs failing → Major (P2)
  Multiple jobs failing + no successful offsite copy for >24 hours → Critical (P1)
  Evidence of ransomware or attacker activity in backup infrastructure → Critical (P1)

P3 response (within 2 hours during business hours):
  1. Check backup agent connectivity on source server
     ping [source server] from backup appliance
     Test-Connection -ComputerName [source] -Count 4

  2. Check backup agent service status on source server
     Get-Service -Name [backup agent service name] -ComputerName [source]
     If stopped: Start-Service or trigger via backup console → retry job

  3. Check available space on source and target
     If space issue: alert storage expansion ITSM ticket

  4. Check backup agent logs on source server:
     Event Viewer → Applications and Services → [backup agent log]
     Look for: permissions errors, connectivity timeouts, VSS errors

  5. Retry the failed job manually
     Monitor to completion

  6. If retry fails: escalate to IT Manager

  7. Document: what failed, what was investigated, resolution, retry result
     Add note to EV-D27 daily log entry

P1/P2 response:
  Immediately notify CISO
  Assess whether backup data integrity is intact (is existing backup data accessible?)
  If ransomware suspected: invoke AT-IR incident response
  If backup data appears intact but jobs are failing: 
    implement manual backup via alternative method until automated system is restored
  Document as incident in EV-D12

Restoration testing procedure (generates EV-D28)

Quarterly restoration test

Restoration testing is mandatory quarterly. The test must restore actual CUI data from the offsite (cloud) backup — not the local backup — to confirm the most critical recovery path works.

Test schedule:
  Q1: January — restore file server data (sample of CUI files)
  Q2: April — restore database (point-in-time recovery test)
  Q3: July — restore VM (CUI application server from image backup)
  Q4: October — restore from cloud (full offsite restoration exercise)

Pre-test checklist:
  [ ] Confirm test environment is available and isolated from production
      (Restoration target must NOT be a production system)
  [ ] Confirm PAM access to backup console for the test engineer
  [ ] Confirm IT Manager is available for sign-off after test
  [ ] Confirm CISO is aware test is occurring (notification, not approval)
  [ ] Record start date/time in EV-D28

Test environment requirements:
  A separate environment must be available for restoration:
    For file restoration: a non-production file share or test VM file system
    For database: a non-production SQL/Postgres instance
    For VM: a non-production hypervisor host or cloud sandbox subscription
  Under no circumstances restore over production data during a test
  Cloud sandbox account (separate from production) should be maintained 
  specifically for DR and restoration testing

Quarterly restoration test — step by step

RESTORATION TEST: File server CUI data (Q1 example)

Step 1 — Select test data
  Choose a sample folder from the CUI file server backup:
    Minimum size: 1 GB
    Must contain actual CUI-classified content (not test files)
    Must have been backed up in the target backup run being tested
  Record: folder path, approximate size, backup job date being tested

Step 2 — Identify the backup version to restore from
  Target: the most recent successful offsite (cloud) backup
  Navigate to: backup console → cloud backup → [job name] → restore point
  Record: backup run date/time, backup type (incremental/full chain)

Step 3 — Initiate restoration to test environment
  In backup console:
    Select: cloud restore (not local restore — the offsite path is what's being tested)
    Target: test server / test file path (NOT production)
    Encryption key: retrieve from PAM vault under "Backup-Keys-CUI"
    Start restoration and monitor progress

  Note start time and expected duration
  If restoration takes >150% of expected duration: investigate before proceeding

Step 4 — Verify data completeness
  After restoration completes:
    File count check:
      Source (from backup metadata): [N] files, [X] GB
      Restored: [N] files, [X] GB
      Match: Yes / No — if No, identify missing files and investigate

    Hash verification (spot check):
      Select 5 random files from the restored set
      Compare SHA-256 hash of restored file against backup metadata hash:
        Get-FileHash [restored file path] -Algorithm SHA256
      All 5 must match their backup metadata hashes
      If any hash mismatch: data integrity failure — escalate to CISO immediately

Step 5 — Verify data is usable
  Open restored CUI documents in appropriate application
  Confirm: documents open correctly, content is readable, no corruption visible
  For databases: execute a sample query against the restored database
  For VMs: boot the restored VM image and confirm services start

Step 6 — Verify decryption
  The restoration test also confirms that backup encryption is functioning:
  If the data was restored and is readable, the decryption succeeded
  Additionally: attempt to access the raw backup file on cloud storage 
  without the decryption key — confirm the raw file is unreadable binary

Step 7 — Clean up
  Delete all restored data from the test environment after verification
  Confirm deletion: directory listing confirms empty
  Log deletion in EV-D28

Step 8 — Complete EV-D28 record

  EV-D28 Quarterly Backup Restoration Test Record — [YYYY-QQ]

  Test date: [date]
  Test type: [File / Database / VM / Full offsite]
  Backup source: [job name]
  Backup run tested: [date/time of backup run]
  Restore-from location: Cloud / Local (must be Cloud for Q4 test)
  Restoration target: [test server / environment name]

  Restoration start time: [HH:MM UTC]
  Restoration completion time: [HH:MM UTC]
  Total duration: [minutes]

  Completeness check:
    Expected files/GB: [N files, X GB]
    Restored files/GB: [N files, X GB]
    Match: Yes / No — [if No: investigation outcome]

  Hash verification (5 files):
    File 1: [filename] — Hash match: Yes / No
    File 2: [filename] — Hash match: Yes / No
    File 3: [filename] — Hash match: Yes / No
    File 4: [filename] — Hash match: Yes / No
    File 5: [filename] — Hash match: Yes / No

  Data usability confirmed: Yes / No — [notes]
  Decryption verification: Pass / Fail
  Raw data unreadable without key: Confirmed / Not tested

  Test environment cleaned up: Yes — [date/time]

  Issues found: None / [description of any issues and resolution]

  Tested by: [engineer name and role]
  IT Manager review and sign-off: _________________ Date: _________

  File at: EV-D · BCM → Restoration Tests → [YYYY-QQ]

Backup key management

Backup encryption key hierarchy:

Master key (HSM-backed):
  Stored in: HSM (hardware security module) — cloud KMS with FIPS 140-2 Level 3 HSM
  Access: CISO + IT Manager only (dual control for key operations)
  Purpose: wraps all data encryption keys (DEKs)
  Rotation: annual — requires planned process with CISO approval

Data encryption keys (DEKs) — one per backup tier:
  Copy 2 (local backup) DEK:
    Stored in: PAM vault, safe "Backup-Keys-CUI"
    Access: IT Operations (backup admin role in PAM)
    Rotation: annual (aligned with master key rotation)

  Copy 3 (offsite/cloud) DEK:
    Stored in: PAM vault, safe "Backup-Keys-CUI-Offsite"
    Access: IT Manager + CISO only (higher-value: offsite key controls most 
            sensitive recovery path)
    Rotation: annual

  Archive DEK (7-year retention tier):
    Stored in: PAM vault + physical encrypted backup in CISO's safe
    Access: CISO + IT Manager (joint access for archive restoration)
    Rotation: 3 years (less frequent to reduce re-encryption burden on archive tier)

Key recovery procedure (if PAM is unavailable):
  Emergency decryption keys are stored in a sealed, CISO-signed physical envelope
  in the fire safe alongside the break-glass credentials (see FC-03)
  Envelope opened only with CISO + IT Manager present
  Envelope opening triggers immediate re-sealing with new envelope after key used

Annual key rotation procedure:
  1. Generate new DEK in KMS
  2. Re-encrypt backup data with new DEK (most backup platforms support in-place re-encryption)
  3. Verify backup jobs complete successfully with new key
  4. Run a restoration test using the new key
  5. Update PAM vault with new key — revoke old key
  6. Log in key management procedure: old key ID, new key ID, rotation date, engineer

OP-02 · Certificate Management

Purpose and scope

This procedure governs the lifecycle of all X.509 certificates used by CUI-scope systems — from request and issuance through monitoring, renewal, and revocation. It implements NIST 800-171 3.13.8 (protect CUI in transit), 3.13.10 (key management), and ISO 27001 Annex A 8.24 (use of cryptography).

Evidence generated: EV-D30 (certificate and key inventory).

Note on browser deadlines: as of 2025, major browser vendors (Apple, Google, Mozilla) have announced progressive reduction of maximum TLS certificate validity to 90 days (Apple's TLS ballot targets 47 days by 2027). Certificate management must increasingly be automated — manual renewal of short-lived certificates is not operationally sustainable. ACME automation is mandatory for all public-facing certificates.

Certificate inventory (EV-D30)

The certificate inventory is the authoritative register of all X.509 certificates in use on CUI-scope systems. It is reviewed monthly and triggers renewal workflows at 60-day and 30-day thresholds.

EV-D30 Certificate Inventory — maintained as a live spreadsheet or 
dedicated certificate management platform (e.g. Venafi / Certbot tracking / 
HashiCorp Vault PKI)

Required fields per certificate entry:

Field                   Description
──────────────────────────────────────────────────────────────────────
Certificate ID          Unique internal reference (CERT-YYYY-NNN)
Common Name (CN)        Primary domain name or system identifier
Subject Alternative     All SANs — enumerate all; wildcard certs must 
Names (SANs)            list the wildcard explicitly
Certificate type        TLS-public / TLS-internal / Code signing / 
                        S/MIME / Device / SSH host / CA
Issuing CA              CA name (Let's Encrypt / DigiCert / internal CA 
                        name + tier)
Serial number           Certificate serial from the CA
SHA-256 fingerprint     Uniquely identifies this specific cert instance
Issue date              Date certificate was issued
Expiry date             Date certificate expires — the critical tracking field
Days until expiry       Calculated field — drives alert thresholds
Key algorithm           RSA-2048 / RSA-3072 / ECDSA P-256 / P-384
Key storage             Where the private key is stored (KMS / HSM / 
                        local — with justification for local)
System(s) installed on  All systems where this cert is deployed
Auto-renewal            Yes (ACME) / No (manual) — with renewal method
Renewal owner           Named role responsible for renewal
Last checked            Date inventory entry was last verified
Status                  Active / Expiring soon / Expired / Revoked
FIPS validated          Yes / No (required Yes for CUI-scope certs)

Monthly inventory review process

First Monday of each month — IT Operations reviews EV-D30:

Filter 1: Expiry within 60 days → Renewal-amber alert
  Action: Confirm renewal is in progress or scheduled
  If auto-renewal (ACME): verify automation is working (check renewal logs)
  If manual: create ITSM renewal ticket with deadline date

Filter 2: Expiry within 30 days → Renewal-red alert
  Action: Escalate to IT Manager; confirm renewal is on track
  If renewal is not in progress: treat as high-priority ITSM ticket
  Notify service owner that service may be disrupted if not renewed

Filter 3: Expiry within 7 days → Emergency
  Action: Notify CISO; implement emergency renewal; prepare for 
          potential service impact
  Emergency renewal procedure: see Emergency Renewal section below

Filter 4: Already expired → Incident
  An expired certificate on a production CUI-scope system is a 
  security incident (the system may be operating insecurely or 
  be inaccessible)
  Create incident record in EV-D12; notify CISO; replace immediately

Filter 5: SHA-1 or MD5 certificates → Immediate action
  Any certificate using SHA-1 or MD5 signature algorithm must be 
  replaced immediately regardless of expiry date
  These algorithms are deprecated — a finding in any assessment

Update inventory: after each review, update the "Last checked" field 
and update status for any changes
File updated EV-D30 in: EV-D · Cryptography → Certificate Inventory → [YYYY-MM]

Certificate types and issuance procedures

Public-facing TLS certificates — ACME automation (mandatory)

All certificates for internet-accessible endpoints must use ACME automation.
Manual renewal for public-facing certificates is not permitted — 
the operational risk of missed renewal at 90-day (and shorter future) 
validity is too high.

Recommended ACME client: Certbot (for Linux) / win-acme (for Windows) / 
                          Caddy (if used as reverse proxy) / cert-manager (Kubernetes)

CA: Let's Encrypt (DV certificates, free, widely trusted) for standard web services
    DigiCert / Sectigo (OV/EV certificates, required for specific contract contexts)
    Note: Let's Encrypt certificates are domain-validated — they prove domain control
    but do not verify the organisation's identity. For services where organisation 
    identity is contractually required (some government portals), use OV certificates.

ACME setup — Linux with Certbot:
  Install: apt install certbot python3-certbot-nginx
  Initial certificate:
    certbot --nginx -d [domain.com] -d [www.domain.com] \
            --email [it-ops@organisation.com] \
            --agree-tos --no-eff-email
  Auto-renewal timer: certbot creates a systemd timer automatically
    Verify: systemctl status certbot.timer
    Timer fires twice daily — attempts renewal when cert has <30 days remaining

  Verify auto-renewal is working:
    certbot renew --dry-run
    Expected: "Simulating renewal of an existing certificate" — success

  SIEM alert for renewal failure:
    Monitor /var/log/letsencrypt/letsencrypt.log for "renewal failed"
    Ship log to SIEM via rsyslog
    Create SIEM alert: renewal failure → High severity → IT Operations alert

ACME setup — Windows with win-acme:
  Download: github.com/win-acme/win-acme
  Run: wacs.exe --source iis --host [domain.com] --installation iis
  Creates Windows Task Scheduler task for auto-renewal
  Verify: run task manually, confirm no errors

  Hook for SIEM notification:
    wacs.exe supports notification scripts on renewal success/failure
    Configure failure notification to send syslog event to SIEM

Certificate deployment post-renewal:
  Some services require explicit reloading after cert renewal:
    nginx: sudo nginx -s reload  (add as --deploy-hook in certbot)
    Apache: sudo apachectl graceful
    IIS: automatically picks up cert if bound correctly — verify binding

  Verify deployment:
    openssl s_client -connect [domain.com]:443 2>/dev/null | \
    openssl x509 -noout -dates
    Confirm: notAfter is the newly issued certificate's expiry

Internal TLS certificates — internal CA

For internal services (management interfaces, internal APIs, internal 
monitoring systems) that don't require public CA validation:

Internal CA hierarchy:
  Root CA (offline — air-gapped):
    Validity: 10 years
    Key: RSA-4096 stored on FIPS 140-2 HSM
    Location: physically secured (CISO's custody)
    Used only to sign subordinate CAs — never issues end-entity certs directly

  Issuing CA (online — internal network):
    Validity: 5 years
    Key: RSA-3072 or ECDSA P-384, HSM-backed if available; KMS minimum
    Signed by: Root CA (requires bringing Root CA online — planned event)
    Issues: end-entity certificates for internal services
    CRL/OCSP: must be reachable from all internal systems that validate certs
    OCSP URL: http://[internal-ocsp-server]/ocsp
    CRL URL: http://[internal-crl-server]/crl/issuing.crl

Requesting an internal certificate:
  1. Generate key pair on the target system (key never leaves the system):
     # Linux:
     openssl genrsa -out [service].key 3072
     openssl req -new -key [service].key -out [service].csr \
             -subj "/CN=[service.internal.domain]/O=[Org Name]/C=GB"

     # Windows (PowerShell):
     $cert = New-SelfSignedCertificate -DnsName "[service.internal]" `
             -CertStoreLocation cert:\LocalMachine\My -NotAfter (Get-Date).AddDays(365)
     $csr = ... (use certreq for proper CSR generation)

  2. Submit CSR to internal CA:
     # If using Microsoft AD CS:
     certreq -submit -config "[CA-server]\[CA-name]" [service].csr [service].cer

     # If using CFSSL / Step CA / Vault PKI:
     [tool-specific command]

  3. Install signed certificate on target system

  4. Register in EV-D30 certificate inventory:
     Add entry with all required fields
     Set renewal reminder based on validity period

Internal cert validity periods:
  Standard services: 365 days (1 year)
  High-churn environments (frequently rebuilt): 90 days with ACME automation
  Root CA: 3650 days (10 years)
  Issuing CA: 1825 days (5 years)
  SSH host certificates (from SSH CA): 365 days
  Short-lived SSH user certs (from PAM): 8 hours

Code signing certificates

Code signing is used to sign scripts and binaries deployed internally,
satisfying the software restriction policies in AT-CM.

Certificate type: Code signing certificate from internal CA
                  OR from commercial CA (if external distribution required)
Key storage: offline HSM (USB hardware token — YubiKey or nShield)
Custodian: CISO (primary) + IT Manager (secondary)
Key usage: digitalSignature only (not key encipherment)
Validity: 3 years
Archive requirements: keep all code signing certs permanently —
                      code signed with a cert must be verifiable for 
                      the life of the signed code

Signing procedure:
  # Windows (PowerShell — requires signing cert in cert store):
  $cert = Get-ChildItem -Path Cert:\CurrentUser\My -CodeSigningCert
  Set-AuthenticodeSignature -FilePath [script.ps1] -Certificate $cert

  # Verify signature:
  Get-AuthenticodeSignature [script.ps1]
  Expected: Status = Valid, SignerCertificate = [org CN]

  # Linux binary signing (via gpg or sigstore):
  [process depends on toolchain — document specific procedure for each 
   signed software type]

Timestamp countersignature:
  Always use a trusted timestamp authority when signing code:
    -TimestampServer http://timestamp.digicert.com
  Reason: timestamped signatures remain valid after the signing cert expires

Certificate revocation procedure

When to revoke a certificate immediately (within 4 hours):
  - Private key is compromised or suspected compromised
  - Certificate was issued in error (wrong CN, wrong organisation)
  - System the certificate is installed on has been decommissioned
  - Security incident where private key may have been exposed

Revocation process:

For Let's Encrypt / ACME certificates:
  certbot revoke --cert-path /etc/letsencrypt/live/[domain]/cert.pem \
                 --reason keyCompromise
  Let's Encrypt revokes within minutes; browsers check OCSP in real-time

For internal CA certificates:
  1. Access the CA management interface
  2. Find the certificate by serial number
  3. Revoke with reason code (keyCompromise / affiliationChanged / 
     superseded / cessationOfOperation)
  4. Update CRL immediately: force CRL publication rather than waiting 
     for scheduled publication
  5. Verify: openssl crl -in [CRL file] -text | grep [serial number]
     Confirm the serial appears in the CRL with the correct revocation date

For device certificates (MDM-enrolled):
  Intune: Devices → [device] → Revoke certificate
  Jamf: Computers → [computer] → Management → Revoke certificate
  The revoked certificate will be removed and re-issued on next MDM check-in

After revocation:
  Update EV-D30: change status to "Revoked", add revocation date and reason
  If key was compromised: log as security incident in EV-D12
  Issue replacement certificate immediately (do not leave service without valid cert)

Emergency certificate renewal

When a certificate expiry is discovered with less than 7 days remaining 
and the standard renewal process cannot complete in time:

Step 1 — Assess impact
  Is this certificate currently causing service disruption (expiry in the past)?
  Which systems and users are affected?
  Is this a CUI-scope service?

Step 2 — Notify
  Immediately notify IT Manager
  If CUI service is disrupted: notify CISO within 1 hour
  If public-facing service causing customer impact: notify via incident process

Step 3 — Emergency issuance
  For ACME-managed certs that have failed auto-renewal:
    certbot renew --force-renewal --cert-name [certname]
    Investigate why auto-renewal failed after emergency renewal is complete

  For manually managed certs expiring within 24 hours:
    Use 90-day Let's Encrypt cert as emergency replacement while 
    longer-validity cert is properly procured and installed

  For internal CA certs:
    Use emergency CA operation procedure — bring signing CA online,
    issue 90-day emergency cert, install, then issue proper 365-day cert 
    via normal process within 7 days

Step 4 — Install and verify
  Install replacement certificate
  Verify service is accessible:
    curl -I https://[service-url] → should return 200
    openssl s_client -connect [service]:443 2>/dev/null | 
    openssl x509 -noout -dates → confirm new expiry date

Step 5 — Post-incident review
  Identify why the certificate expiry was not caught by the inventory process
  Create corrective action in EV-A03
  Confirm EV-D30 is updated with the new certificate details
  Update auto-renewal automation if the failure was automation-related

OP-03 · Logging and SIEM Management

Purpose and scope

This procedure governs the operation, maintenance, and health monitoring of the SIEM platform and the log collection infrastructure. It implements NIST 800-171 controls 3.3.1 through 3.3.9 (Audit and Accountability family) and ISO 27001 Annex A 8.15 (logging), 8.16 (monitoring), and 8.17 (clock synchronisation).

Evidence generated: EV-F01 (monthly log review), EV-F06 (SIEM health report).

SIEM platform operational overview

SIEM platform: [product name — e.g. Microsoft Sentinel / Splunk / IBM QRadar / 
                Elastic SIEM — specify deployed product]
SIEM location: [cloud-hosted / on-premises]
SIEM admin console: [URL or access path]
SIEM admin access: PAM-mediated — privileged account checkout required
                   Maximum 2 SIEM admin accounts: CISO + IT Manager
SIEM analyst access: [role-based access — see RBAC configuration below]

Retention tiers:
  Hot (searchable online): 90 days
  Warm (retrievable within 24 hours): days 91–365
  Archive (retrievable within 72 hours): days 366–1095 (36 months total)

Retention enforcement: automated tiering configured in SIEM platform
  Do not manually delete log data
  Do not modify retention settings without CISO approval and RFC
  Legal hold: certain log segments can be set to legal hold — 
              exempts them from normal retention lifecycle

Log source management

Adding a new log source

Every new CUI-scope system must be added to the SIEM as a log source within 5 business days of deployment. Failure to add a system is a gap against NIST 800-171 3.3.1.

Step 1 — Identify the log source type and forwarding method
  Reference: AT-AU Section 3 (log source inventory) for required event categories
  Determine forwarding method:
    Windows → Windows Event Forwarding (WEF) to Windows Event Collector (WEC)
             OR direct SIEM agent installation
    Linux → rsyslog + audisp-syslog → SIEM syslog listener
    Network devices → syslog to SIEM syslog listener (UDP 514 or TCP 514/6514)
    Cloud → native connector (Sentinel Data Connector / Splunk Add-on / etc.)
    SaaS (M365, AWS) → platform-native SIEM integration

Step 2 — Configure the source system for log forwarding

  Windows via WEF:
    On source system (GPO or local policy):
      Computer Configuration → Administrative Templates → Windows Components →
      Event Forwarding → Configure target Subscription Manager:
        Server=[WEC server FQDN], Refresh=60, Port=5985

    On WEC server:
      wecutil cs [subscription-name].xml  (create subscription)
      Verify events are arriving: wecutil gs [subscription-name]

    WEF Subscription XML minimum content:
      <QueryList>
        <Query Id="0">
          <Select Path="Security">*</Select>
          <Select Path="System">*[System[(EventID=7045)]]</Select>
          <Select Path="Microsoft-Windows-PowerShell/Operational">
            *[System[(EventID=4104)]]
          </Select>
        </Query>
      </QueryList>

  Linux via rsyslog:
    Edit /etc/rsyslog.d/00-siem.conf:
      *.* @@[SIEM-IP]:514    # TCP (recommended)
      # Or: *.* @[SIEM-IP]:514   # UDP

    Configure audisp to forward auditd events via syslog:
      Edit /etc/audisp/plugins.d/syslog.conf:
        active = yes
        direction = out
        path = builtin_syslog
        type = builtin
        args = LOG_INFO
        format = string

    Restart: systemctl restart rsyslog auditd

    Test: logger "Test SIEM forwarding from [hostname]"
    Verify test message appears in SIEM within 60 seconds

  Network device (syslog):
    Configure syslog on the device (vendor-specific commands):
      Example (Cisco IOS):
        logging host [SIEM-IP] transport tcp port 514
        logging trap informational
        logging on
        service timestamps log datetime msec
        ntp server [NTP-IP]  (ensure timestamps are synchronised)

    Verify: trigger a known event (interface flap, authentication) 
            and confirm event appears in SIEM

Step 3 — Create the log source entry in SIEM
  In the SIEM admin console:
    Add new data source / log source
    Configure parser/normalisation (use standard parsers where available)
    Set expected event volume (used for health monitoring)
    Configure field extraction mappings (IP, user, event type, etc.)

Step 4 — Verify receipt and parsing
  After configuration:
    Confirm events are arriving: SIEM search for source=[new-system]
    Confirm events are parsed correctly: key fields (user, event ID, timestamp) 
    are populated — not raw strings
    If events arrive but are not parsed: check parser configuration; 
    may need custom field extraction rule

Step 5 — Create health monitoring entry
  In SIEM health monitoring configuration:
    Add the new source to the "expected sources" list
    Set the "silent alert" threshold: 60 minutes for Critical sources;
    4 hours for High sources; 24 hours for Medium sources

  Reference the log source tier from AT-AU Section 3:
    Identity systems (Entra ID, AD) → Critical → 60-minute silence alert
    Network boundary (firewall, IDS) → Critical → 60-minute silence alert
    Endpoints → High → 4-hour silence alert
    Applications → Medium/High per sensitivity

Step 6 — Update the log source inventory in AT-AU Section 3
  Add the new source to the table with:
    Source name and type
    Required event categories
    Forwarding mechanism
    Expected daily event volume
    Priority tier

Step 7 — Update EV-D19 (SIEM configuration baseline)
  Add the new source to the baseline document
  Record: date added, system name, forwarding method, SIEM data source name

Removing a decommissioned log source

When a CUI-scope system is decommissioned:
  1. Decommission the log source in SIEM admin console
     (retain historical data — only remove the active collection)
  2. Remove from the expected sources health monitoring list
  3. Update AT-AU Section 3 log source inventory: mark as decommissioned with date
  4. Update EV-D19 baseline
  5. Retain historical log data for the full retention period
     (a decommissioned system's logs remain part of the audit trail)

NTP synchronisation management

SIEM log correlation depends on all sources having synchronised clocks. A 5-minute drift on a Windows system causes Kerberos authentication to fail — an automatic enforcement of clock sync. But smaller drift (seconds to minutes) still corrupts log correlation and creates investigation timeline ambiguities.

NTP hierarchy:

Stratum 0 (reference clocks):
  UK Stratum 1 sources (do not contact directly):
    time.google.com
    ntp.pool.org (uk.pool.ntp.org)

Stratum 1 (organisational NTP server — contacts Stratum 0):
  Server: [internal NTP server hostname / IP]
  Implementation: ntpd or chrony on a dedicated or shared Linux server
  Authentication: NTP symmetric key or NTPsec (prevents NTP spoofing)

  chrony configuration (/etc/chrony.conf):
    server time.google.com iburst prefer
    server uk.pool.ntp.org iburst
    allow [internal network CIDR]     # Allow internal clients
    local stratum 10                   # Serve time even if not synced
    keyfile /etc/chrony.keys
    log measurements statistics tracking

Stratum 2 (all CUI-scope systems — sync from internal NTP server):
  Windows: configure via GPO
    Computer Configuration → Administrative Templates → System →
    Windows Time Service → Time Providers → Configure Windows NTP Client:
      NtpServer: [internal-ntp-server],0x9
      Type: NTP

  Linux: chrony or ntpd pointing to internal NTP server:
    server [internal-ntp-server] iburst
    Verify: chronyc tracking
    Acceptable offset: <0.1 seconds; Alert threshold: >60 seconds

  Network devices: vendor-specific NTP configuration (see BL-NET baseline)

Monitoring for clock drift:
  SIEM monitor: if any source timestamp differs from SIEM server clock by 
  >60 seconds, generate High alert

  Windows domain controller check:
    w32tm /query /status /computer:[DC-hostname]
    Key field: "RMS Offset" — should be <1 second for domain members

  Linux check:
    chronyc tracking | grep "RMS offset"
    chronyc sources -v  (shows all peers and their offsets)

  The SIEM server itself must be synced:
    Its timestamp is the reference for all correlation
    SIEM server clock offset should be <100 milliseconds from NTP source
    Monitor the SIEM server's NTP status as part of EV-F06 monthly health check

SIEM correlation rule management

Correlation rules (also called detection rules, analytics, or alert rules 
depending on the SIEM product) require lifecycle management:

Rule library location:
  [SIEM platform] → Analytics → Active Rules
  Rules are also version-controlled in the configuration management 
  repository (GitLab/ADO) — SIEM rule definitions exported as YAML/JSON
  and committed to the repo, allowing change tracking and rollback

Rule categories for CUI-scope monitoring:

Category 1 — Credential and identity threats:
  MFA approval from new country/device combination (impossible travel)
  MFA fatigue: >5 MFA push notifications in 5 minutes to same account
  Password spray: >10 failed logins across >10 accounts in 5 minutes from one IP
  Admin account login outside business hours without prior notification
  Legacy authentication success (should never succeed — CA policy blocks it)
  Break-glass account sign-in (any use is a Critical alert)

Category 2 — Endpoint threats:
  Windows Event Log cleared (Event ID 1102) — attacker evidence removal
  PowerShell script execution with suspicious keywords (Invoke-Expression,
    DownloadString, EncodedCommand) — Event ID 4104
  New service installed (Event ID 7045) — common malware persistence
  Process spawning anomaly (Word/Excel spawning cmd.exe or PowerShell)
  USB storage device connected (vendor-specific EDR event)
  AV threat detected and NOT remediated within 30 minutes

Category 3 — Network threats:
  Data transfer volume anomaly: source sending >5× their 30-day baseline 
    to an external destination
  Connection to CISA KEV-associated IP or domain (threat intel feed integration)
  DNS query to DGA-pattern domain (high-entropy domain names)
  Firewall rule modification outside maintenance window
  New outbound port opened that was not previously used

Category 4 — Audit infrastructure threats:
  SIEM log source silent for >60 minutes (Critical sources)
  Auditd service stopped on any CUI-scope Linux server
  SIEM configuration change (any change to retention, sources, or rule config)
  Log volume from a source drops >50% from 7-day baseline (possible suppression)

Category 5 — CUI access anomalies:
  CUI file share access after 22:00 or before 06:00
  CUI file share access from account not normally accessing CUI
  Mass download from CUI file share (>500 files in 10 minutes by one account)
  CUI file share access during an active leaver process (account in 90-day hold)

Rule update process:
  New rules require CISO approval and a Normal RFC (AT-CM change process)
  Changes to existing rules: Normal RFC
  Emergency rule creation (during an active incident): Emergency RFC, 
    retrospective CISO approval within 24 hours
  Rule deletion: Normal RFC; rule is archived (exported to git), not deleted,
    so it can be re-activated if needed

  All rule changes are logged in the SIEM admin audit trail
  SIEM admin audit trail itself is monitored for unexpected changes

Monthly log review procedure (generates EV-F01)

The monthly log review is conducted by the Security Analyst in the first week of each month, covering the preceding calendar month.

Target completion: by the 7th of each month
Duration: approximately 2 hours
Conducted by: Security Analyst (primary), CISO (review and sign-off)

STEP 1 — SIEM health check (15 minutes)

  A. Open SIEM health dashboard
     Confirm all expected log sources show as active
     Review any sources that went silent in the past month:
       Source name, silence duration, resolution

  B. Verify log volume is within expected ranges
     Sources with significantly lower volume than the 30-day baseline:
       Investigate — low volume may indicate suppression or forwarding issue
     Sources with significantly higher volume than baseline:
       Note — may be legitimate (security incident, patch activity) or 
       may indicate log flooding / misconfiguration

  C. Check storage utilisation trends
     Hot tier: [X]% used — trend from last month
     Warm tier: [X]% used
     Action if >80%: create ITSM ticket for capacity expansion

STEP 2 — Alert queue review (30–60 minutes)

  Open SIEM alert queue: filter to past month
  Work through all alerts not yet closed from the previous period

  For each alert:
    Triage: True Positive / False Positive / Under Investigation
    True Positive: link to incident record in EV-D12
    False Positive: document why and consider rule tuning
    Under Investigation: assign to analyst with target closure date

  Summary metrics to record:
    Total alerts: [N]
    True positives: [N] → incidents raised: [N]
    False positives: [N] → rule tuning actions: [N]
    Still under investigation: [N]

STEP 3 — Privileged account activity review (20 minutes)

  SIEM search: source=AD OR source=EntraID 
               AND (username like "adm-%" OR role="GlobalAdmin")
               AND timeframe=[last month]

  Review for:
    Admin logins outside business hours not corresponding to a change window
    Admin logins from unexpected locations
    Admin actions outside their normal system scope
    Any use of break-glass accounts (should be zero — any use = escalate to CISO)

  Document: "Privileged activity reviewed — [N] events — [Clean / Anomalies found]"
  If anomalies found: describe each and document investigation outcome

STEP 4 — CUI system access review (20 minutes)

  SIEM search: source=[CUI-fileserver] OR source=[CUI-database]
               timeframe=[last month]

  Review for:
    After-hours access events (outside 07:00–20:00 local time)
    Access by accounts not normally accessing CUI (new first-time accesses)
    Large file access events (>100 files accessed in one session)
    Access denied events (repeated failures may indicate reconnaissance)

  Document: "CUI access reviewed — [N] events — [Clean / Anomalies found]"

STEP 5 — Network boundary review (15 minutes)

  Firewall summary: total permitted / denied connections (monthly totals)
  IDS/IPS: alert count by severity — Critical: [N], High: [N], Medium: [N]
  DNS anomalies: NXDOMAIN spike count vs previous month
  Web proxy: blocked malicious categories count

  Flag for investigation:
    IDS Critical alerts that were not escalated to the incident process
    DNS NXDOMAIN count >200% of prior month baseline

  Document: "Network boundary reviewed — [summary stats] — [Clean / Anomalies]"

STEP 6 — Authentication anomaly review (15 minutes)

  SIEM search: authentication failures > [threshold] for any account
  Failed login volume: total by system
  Account lockout events: list accounts locked more than once in the month
  MFA anomalies: denied notifications (MFA fatigue indicators)
  Successful logins from new countries or devices

  Document: "Authentication reviewed — [summary] — [Clean / Anomalies]"

STEP 7 — Sign off EV-F01

  EV-F01 Monthly SIEM Log Review Record — [YYYY-MM]

  Review period: [first date] to [last date of month]
  Reviewed by: [Security Analyst name]
  Review date: [date completed]

  1. SIEM Health
     Log sources active: [N of N expected]
     Log sources with gaps: [list or "None"]
     Storage: Hot [%] / Warm [%]
     Issues: [None / description]

  2. Alert Queue
     Total alerts in period: [N]
     True positives: [N] — Incidents raised: [list incident IDs or "None"]
     False positives: [N] — Rule tuning actions: [list or "None"]

  3. Privileged Account Activity
     Events reviewed: [N]
     Assessment: Clean / Anomalies found
     Anomalies: [description or "None"]

  4. CUI Access Activity
     Events reviewed: [N]
     Assessment: Clean / Anomalies found
     Anomalies: [description or "None"]

  5. Network Boundary
     Assessment: Clean / Anomalies found
     IDS/IPS summary: Critical [N], High [N], Medium [N]

  6. Authentication
     Assessment: Clean / Anomalies found
     Lockouts: [accounts locked more than once]

  Overall assessment: Normal / Anomalies identified (all investigated) / 
                      Incidents opened (see above)

  Security Analyst sign-off: _________________ Date: _________
  CISO review: _________________ Date: _________

  File at: EV-F · Continuous Monitoring → Log Reviews → [YYYY-MM]

Monthly SIEM health report (generates EV-F06)

EV-F06 is distinct from EV-F01. EV-F01 records what was found in the logs. EV-F06 records whether the audit infrastructure itself is functioning correctly.

EV-F06 SIEM Health Report — [YYYY-MM]

Produced by: IT Manager
Target completion: by the 10th of each month

Section 1 — Log source inventory status
  For each source in the log source inventory (AT-AU Section 3):

  | Source name | Type | Expected daily vol | Actual daily vol | Last event | Status |

  Status values:
    Active: events received within expected window
    Degraded: events received but volume significantly below expected
    Gap: no events received for [N] hours/days during the month
    (If Gap: note gap duration and resolution)

Section 2 — Retention verification
  Sample retrieval test — monthly:
    Randomly select a date between 9 and 11 months ago
    Search for events from that date: source=[sample source] 
    timeframe=[that specific date]
    Confirm events are retrievable:
      Test date: [date]
      Events found: Yes / No
      Retrieval time: [seconds]
      Result: Pass / Fail

Section 3 — Storage capacity
  Hot tier: [X]% of [total capacity]
  Warm tier: [X]% of [total capacity]
  Growth trend: [±X]% month-over-month
  Projected time to 80% (hot tier): [N] months at current growth rate

Section 4 — Tamper evidence check
  SIEM admin audit trail review:
    Configuration changes in the past month: [N]
    Each change: [date, what changed, who made the change]
    Unexpected changes: None / [description and investigation]

  Log deletion events: [N — expected: 0, unless automated retention lifecycle]
  Log integrity: [All hashes verified / Issues found]

  SIEM immutability status: Confirmed active / Degraded / Failed

Section 5 — SIEM admin role holders
  Current SIEM administrators: [Name 1 — CISO] / [Name 2 — IT Manager]
  Change from prior month: None / [description of any change]
  (Maximum 2 admin role holders — any change requires CISO approval)

Section 6 — NTP synchronisation status
  SIEM server clock offset from NTP: [X ms] — [Pass if <100ms]
  Any clock sync failures reported by SIEM during month: None / [description]
  AD/DC NTP offset sample:
    [DC-01]: [X ms] — Pass/Fail
    [DC-02]: [X ms] — Pass/Fail

Section 7 — Log integrity verification
  Monthly hash verification of archived log segments:
    Sample period: [month N-12]
    Hashes verified: [N segments]
    Hash match: Yes — all segments verified / No — [describe failure]

Section 8 — Sign-off
  IT Manager review: _________________ Date: _________
  CISO review: _________________ Date: _________

  File at: EV-F · Continuous Monitoring → SIEM Health → [YYYY-MM]

OP-04 · Infrastructure Change Management

Purpose and scope

This procedure implements the change management process for all CUI-scope infrastructure changes. It implements NIST 800-171 controls 3.4.3 (track, review, approve, and log changes), 3.4.4 (security impact analysis), and 3.4.5 (access restrictions for change). It aligns with ISO 27001 Annex A 8.32 (change management).

Evidence generated: EV-D21 (change management records).

Change categories — decision guide

Before creating any RFC, determine the correct change category:

STANDARD (pre-approved — no CAB review required):
  Pre-defined, low-risk, well-understood changes with documented 
  implementation steps. The change type itself has been approved by the 
  CAB — individual instances do not need separate review.

  Examples:
    OS patching within the defined SLA window via MDM/WSUS
    Certificate renewal via ACME automation (no manual steps)
    LAPS password rotation (fully automated)
    Helpdesk password resets for standard user accounts
    Adding a user to an approved, pre-defined access group

  Standard change library: maintained in ITSM as a catalogue
  If a proposed change is similar to but not exactly a standard change type:
    → Use Normal category. Do not force-fit into Standard.

  Evidence: ITSM ticket is created automatically for tracking; 
            no RFC number needed; logged in EV-D21 as standard change

NORMAL (individual CAB review — 48-hour minimum review period):
  Changes that require individual assessment because their specific 
  characteristics, scope, or risk cannot be fully pre-defined.

  Examples:
    New GPO setting or modification to existing GPO on CUI-scope OU
    New MDM configuration profile deployment
    Firewall rule addition or modification
    New software deployment to CUI-scope endpoints
    Database schema change on CUI-scope database
    Network segmentation change
    New API integration or external system connection
    Baseline update following CIS Benchmark revision

  CAB composition: IT Manager (chair) + CISO + Network Engineer + 
                   rotating business representative
  CAB meeting: weekly (Wednesday 10:00) or on-demand for urgent items
  Approval requirement: CAB chair sign-off minimum; CISO for CUI-impacting changes
  Post-implementation review: required within 48 hours

MAJOR (CAB + CISO + senior management notification):
  Changes with potential to materially affect the security posture, 
  the CUI system boundary, or the SSP description.

  Examples:
    New platform added to CUI scope (new OS type, new cloud service)
    Identity provider migration or significant configuration change
    Firewall architecture change (new DMZ zone, perimeter redesign)
    CUI data flow change (new source or destination of CUI)
    Change that requires SSP system boundary update
    Decommissioning a CUI-scope system and redistributing its function

  Approval: CAB + CISO + CEO/COO notification
  Review period: 5 business days minimum
  SSP update: required within 30 days if SSP boundary or control description changes
  Tested rollback plan: mandatory and tested in staging before production

EMERGENCY (immediate; document within 24 hours):
  Changes required to prevent or respond to an active security incident 
  or critical service outage where the risk of delay exceeds the risk 
  of bypassing the normal approval process.

  Trigger criteria (at least one must apply):
    Active security incident requiring immediate containment
    CUI-scope service outage affecting contract delivery obligations
    Active ransomware spread that change will contain

  Approval: on-call IT Manager + CISO verbal approval (document within 24 hours)
  Implementation: proceed immediately; document within 24 hours
  Retrospective review: mandatory at next CAB meeting
  Evidence: emergency change record in EV-D21 within 24 hours of implementation

RFC creation and content requirements

All Normal and Major changes require a Request for Change (RFC) 
created in the ITSM platform before implementation.

RFC mandatory fields:

1. RFC title
   Short descriptive title: "[System] — [Type of change] — [Date]"
   Example: "PRODDB01 — SQL Server 2022 Security Patch — 2024-03-15"

2. Change description
   What is being changed, specifically:
     - Which system(s)
     - What component (OS / application / configuration / network)
     - What is the current state
     - What will be the new state after the change

3. Business justification
   Why is this change needed now?
   What is the risk of not making this change?

4. Security Impact Assessment (SIA)
   Complete the SIA template embedded in the RFC form.
   For configuration changes, include:

   a) CUI scope impact: Does this change affect a system that stores, 
      processes, or transmits CUI? If Yes: which CUI categories?

   b) Security controls affected: Which NIST 800-171 controls does 
      this change affect (add, remove, or modify)?
      Reference the specific AT-[family] page if relevant.

   c) New attack surface: What new connectivity, service, account, 
      or data flow does this change introduce?

   d) Failure risk: What happens if the change fails mid-implementation?
      Is partial completion more dangerous than either original state 
      or target state?

   e) Dependencies: What upstream and downstream systems could be 
      affected by this change?

   f) Testing: Has this been tested in a non-production environment?
      If Yes: describe test environment and outcomes.
      If No: justify why testing is not feasible.

   g) Post-change monitoring: What should be monitored in the 24 hours 
      after implementation to detect adverse effects?

5. Implementation plan
   Step-by-step implementation instructions specific enough that a 
   different IT Operations engineer could execute the change:
   - Commands to run
   - Configuration values to set
   - Verification steps after each significant step
   - Expected output at each step

6. Rollback plan
   How will the change be reversed if it fails or causes adverse effects?
   - Rollback trigger criteria (what conditions warrant rollback?)
   - Rollback steps (specific commands/actions)
   - Rollback verification (how do you confirm rollback was successful?)
   - Rollback decision maker (who authorises rollback?)

   For Major changes: rollback must be tested in staging before production

7. Maintenance window
   When will the change be implemented?
   What is the maximum duration? (if exceeded → rollback trigger)
   Is a service outage expected? If yes: who has been notified?

8. Post-implementation review
   When will the post-implementation review occur? (within 48 hours for Normal)
   What will be checked? (link to verification steps from implementation plan)
   Who will conduct it?

CAB meeting procedure

CAB meetings: every Wednesday 10:00–11:00 (or on-demand for urgent items)
Chair: IT Manager
Attendees: CISO, Network Engineer, rotating business representative,
           change requestors for items on the agenda

Agenda for each CAB meeting:

1. Review of emergency changes since last CAB (5 minutes)
   - Any emergency changes implemented since last meeting
   - Confirm retrospective documentation is complete
   - Assess whether emergency change revealed a process gap

2. Post-implementation reviews from previous week (10 minutes)
   - Any Normal changes implemented last week
   - Were there unexpected effects?
   - Is there a follow-up ITSM ticket if issues were found?

3. Upcoming changes — review and approval (main agenda item):
   For each RFC submitted for this CAB:
     Requestor presents: what, why, SIA summary, rollback plan
     CAB questions and challenge
     Decision: Approved / Approved with conditions / Deferred / Rejected
     Conditions (if applicable): record specific conditions in RFC

4. Forward look — Major changes planned (5 minutes)
   Awareness item for upcoming Major changes requiring 5-day review

CAB minutes:
   Documented in ITSM as a linked note to each RFC reviewed
   Separately filed as a CAB meeting record in EV-D21
   Minutes must show: attendees, each RFC reviewed, decision, conditions

Change implementation and evidence (generates EV-D21)

For every Normal or Major change:

Pre-implementation checklist (complete within 1 hour before start):
  [ ] RFC status: Approved (not Draft or Under Review)
  [ ] Maintenance window confirmed with all affected stakeholders
  [ ] Rollback plan reviewed and rollback steps accessible
  [ ] Backup/snapshot of affected system taken (where technically feasible):
      VM: snapshot via hypervisor before change
      Config file: git commit or manual backup of current config
      Database schema: schema export before change
  [ ] SIEM alert enhancement: notify SIEM team to monitor for 
      change-related anomalies during and after implementation window

During implementation:
  Log each step as it is executed in the RFC implementation log field
  Record: timestamp, step taken, observed result, pass/fail
  If unexpected result: pause, assess, consult rollback criteria

  Do NOT proceed through unexpected results without assessing whether 
  rollback should be triggered. The definition of rollback trigger 
  criteria in the RFC is there specifically for this moment.

Post-implementation verification:
  Execute the verification steps from the RFC
  For each verification check:
    Expected result: [from RFC]
    Actual result: [observed]
    Match: Yes / No

  If any verification check fails:
    Assess: is this a critical failure (rollback) or a known side-effect (accept)?
    If rollback: execute rollback plan; document what failed and why
    If accept: document why the failure is acceptable and any follow-up needed

Complete the EV-D21 change record:

  EV-D21 Change Management Record — RFC-[YYYY-NNN]

  RFC title: [from RFC]
  Change category: Standard / Normal / Major / Emergency
  Requestor: [name]
  Implemented by: [engineer name]

  CAB approval: [approver names and date] / Standard change (pre-approved) / 
                Emergency (verbal approval: [name + date])

  Maintenance window: [start date/time] to [end date/time]
  Actual implementation: [start] to [end]

  SIA reference: [SIA completed: Yes / Not required for Standard]
  CUI scope affected: Yes / No — [if Yes: which CUI categories]

  Implementation outcome:
    [ ] Completed as planned
    [ ] Completed with minor deviations (document below)
    [ ] Partially completed — follow-up required (document below)
    [ ] Rolled back (document below)

  Deviations/issues: [description or "None"]

  Rollback executed: Yes / No
  If Yes: rollback trigger, rollback steps taken, result

  Post-implementation verification:
    [Verification check 1]: Expected [X] / Actual [X] — Pass/Fail
    [Verification check 2]: Expected [X] / Actual [X] — Pass/Fail
    [...]

  Follow-up ITSM tickets raised: [list or "None"]

  SSP update required: Yes (within 30 days) / No

  IT Manager review: _________________ Date: _________

  File at: EV-D · Config Management → Change Log → [YYYY]

OP-05 · BCM and DR Testing Procedures

Purpose and scope

This procedure governs the testing of the organisation's Business Continuity and Disaster Recovery capabilities. It implements ISO 27001 Annex A 5.29 (information security during disruption), Annex A 5.30 (ICT readiness for business continuity), and NIST 800-171 3.6.3 (test the organisational incident response capability in the context of DR-scale incidents).

Evidence generated: BCM exercise records, DR test records (filed under EV-A and linked from AT-IR EV-D15).

BCM/DR test programme — annual schedule

The following tests are conducted each year. The programme escalates 
in complexity from desk-based to full technical exercise:

Q1 — BCP tabletop exercise
  Format: facilitated discussion (2 hours)
  Scenario: complete loss of primary office (fire / flood / denial of access)
  Participants: all department heads, CISO, IT Manager, HR Manager
  Focus: manual procedures, communication cascade, staff welfare, 
         decision-making without IT systems for first 4 hours

Q2 — IT failover test (technical — partial)
  Format: planned, controlled failover test during a maintenance window
  Scenario: primary server room is unavailable — activate cloud DR environment
  Participants: IT Operations team only (business not involved)
  Focus: can we fail over to cloud infrastructure? what is the RTO?
  Systems tested: DNS failover, identity platform (Entra ID — already cloud),
                  file server failover, email continuity

Q3 — Full DR exercise (technical + business)
  Format: half-day exercise (business working from DR environment)
  Scenario: primary infrastructure unavailable — staff working on DR systems
  Participants: IT Operations + representative staff from each department
  Focus: can the business function on DR infrastructure? 
         what breaks that we did not expect?

Q4 — Cyber incident tabletop (combined IR + BCM)
  Format: facilitated scenario exercise (3 hours)
  Scenario: ransomware attack affecting CUI-scope systems
  Participants: full IRT + senior management (Executive Sponsor invited)
  Focus: incident response + business continuity simultaneously; 
         external reporting decisions under time pressure
  This exercise also satisfies AT-IR 3.6.3 (test IR capability)

Q1 — BCP tabletop exercise procedure

Pre-exercise preparation (CISO — 2 weeks before):

  Scenario selection:
    Select a realistic scenario relevant to the organisation's location and risk profile
    Common scenarios:
      Fire in building — staff cannot access premises for 5 days
      Major flood — building inaccessible for 2 weeks; some equipment damaged
      Extended power failure — 48-hour outage in the building and surrounding area
      Key person unavailability — CISO and IT Manager both unavailable for 10 days
      Pandemic or public health emergency — all staff working remotely for 30 days

    The scenario should NOT be the same as the previous year's tabletop

  Inject preparation:
    Prepare 8–10 "injects" (new developments that emerge during the exercise)
    Example injects for office fire scenario:
      T+30 min: Fire service confirms building inaccessible for at least 72 hours
      T+2 hours: IT Manager calls to say their laptop was left in the building
      T+4 hours: Customer calls asking for update on a time-sensitive contract deliverable
      T+24 hours: Insurance company asks for a list of affected assets
      T+48 hours: Contract authority (MOD or US government) asks for a formal situation report
      T+72 hours: Fire service extends access denial — now estimating 7 days minimum

    Injects should force decisions, not just provide information

  Participant briefing (1 week before):
    Send pre-reading: current BCP summary (key decisions, contact lists, priorities)
    Confirm attendance — all department heads required
    Set expectations: this is a learning exercise, not a test of performance

Exercise facilitation:

  Introduction (10 minutes):
    CISO explains the exercise format
    Ground rules:
      Treat it as real — make the decisions you would actually make
      No mobiles / email during exercise (forces reliance on BCP procedures)
      All decisions documented by designated scribe
    Set the scene: read the initial scenario brief

  Exercise phases (90 minutes):
    Phase 1 — Initial response (T+0 to T+4 hours simulated time):
      How do we account for all staff?
      How do we communicate with staff who don't know what happened?
      Who contacts the customer? What do we tell them?
      Where do people work? Do we have remote working capability for everyone?

    Phase 2 — Short-term continuity (T+4 hours to T+48 hours):
      What systems do we absolutely need? In what order?
      Are all staff able to work remotely with current equipment?
      How do we handle staff without company laptops?
      What are our contractual obligations we must meet in the next 48 hours?

    Phase 3 — Extended disruption (T+48 hours to T+7 days):
      Do we need temporary premises? 
      Do we need to notify regulators or contracting authorities?
      What does week 2 look like if the building remains inaccessible?
      How do we handle payroll if it falls due during the disruption?

  Inject delivery: CISO introduces each inject at appropriate times

  Debrief (20 minutes):
    What went well?
    What decisions were hardest to make and why?
    What information did you need that you could not quickly find?
    What BCP documentation changes would have helped?
    What actions should we take before the next exercise?

Post-exercise (within 10 days):
  CISO produces exercise report:
    Scenario description and objectives
    Participant list
    Timeline of decisions made during exercise
    Findings — strengths: what worked
    Findings — gaps: what did not work or was unclear
    Action items: specific, owned, dated

  Action items entered in EV-A04 (corrective action register)
  BCP document updated where gaps were identified
  Exercise report filed and linked from AT-IR EV-D15

Q2 — IT failover test procedure

Pre-test planning

Test objective: verify that CUI-scope systems can fail over to the 
cloud DR environment and that recovery time meets the RTO target.

RTO targets (from BCP):
  Identity / authentication (Entra ID): already cloud — continuous availability
  Email (Exchange Online): already cloud — continuous availability
  CUI file server: RTO target [4 hours] — time from declared DR to 
                   staff able to access CUI files via DR environment
  CUI database: RTO target [8 hours]
  Development environment: RTO target [24 hours] (not critical path)

DR environment description:
  Cloud subscription: [DR Azure subscription / AWS account — specify]
  This is a separate subscription from production
  Pre-provisioned with: [list pre-provisioned resources — VMs, storage, networking]
  DNS: [failover DNS configuration — describe how DNS is switched]
  VPN: [failover VPN endpoint — pre-configured or needs activation?]

Test scope:
  Systems tested this quarter: CUI file server + DNS failover
  Systems excluded from test: database (tested in Q3 full exercise)

Maintenance window:
  Duration: [4 hours minimum — allow for test + rollback]
  Schedule: Saturday morning 06:00–10:00 (minimise business impact)
  Notification: IT Manager notifies CISO and department heads 1 week before

  Users affected: all staff who access CUI files
  User communication: "Planned IT maintenance Saturday 06:00–10:00 — 
                       CUI file server unavailable during this period"

Failover test — step by step

PRE-TEST (T-1 hour — Friday evening before test):
  [ ] Confirm last backup to cloud (offsite Copy 3) completed successfully
  [ ] Confirm DR environment is powered on and reachable:
      ping [DR-jumphost-IP] from IT Operations workstation
  [ ] Confirm DR environment DNS records are pre-configured:
      nslookup [fileserver.internal] [DR-DNS-IP] → should resolve to DR IP
  [ ] Confirm test accounts have access to DR environment:
      Test login to DR jump host via PAM
  [ ] Send test start notification to CISO

TEST EXECUTION (06:00 Saturday):

  T+00:00 — Initiate failover
    Step 1: Update DNS to point fileserver.internal to DR IP
      [DNS change commands or console steps — specify for deployed DNS]
    Step 2: Verify DNS propagation:
      On 3 different clients: nslookup fileserver.internal
      Expected: DR IP address returned
      Time for propagation to complete: [expected time — depends on TTL]

  T+00:30 — Verify file access via DR environment
    Log into DR environment via PAM
    Confirm: CUI file share is accessible
    Confirm: file listing matches last known production state
    Confirm: a test file can be read successfully
    Confirm: a test file can be written (if write access is required in DR)
    Record: time from DNS change to confirmed file access = [actual RTO]

  T+01:00 — Stress test (optional — if time permits)
    Have 3 test users simultaneously access files via DR environment
    Confirm performance is acceptable (not necessarily production-equivalent)

  T+01:30 — Initiate rollback to production
    Step 1: Revert DNS to production IP
    Step 2: Verify DNS propagation back to production:
      nslookup fileserver.internal → production IP returned
    Step 3: Verify production file server is accessible:
      Test file read from production server
    Step 4: Confirm no data changes in DR environment need to be 
            synced back to production (DR environment was read-only 
            during test — if write was tested, sync back changes)

  T+02:00 — Test complete
    Verify: all users pointing to production server
    Verify: DR environment can be powered down / returned to standby
    Notify: CISO test is complete and outcome

POST-TEST (within 48 hours):
  Document test record:
    Test date and time
    Systems tested
    Actual RTO achieved vs target: [X minutes] vs [target minutes]
    Verification steps and results (pass/fail for each)
    Issues encountered
    Rollback success: Yes / No
    Recommendations: what could improve RTO? what was harder than expected?

  Action items → EV-A04
  Test record filed under BCM exercise records

Q3 — Full DR exercise procedure

Exercise design

The full DR exercise involves actual staff working on DR infrastructure 
for a defined period (target: 2 hours of productive work on DR systems).

This is more than an IT test — it validates that the business can actually 
function, not just that IT systems are reachable.

Participants:
  IT Operations: full team — manages the failover
  Business participants: 2–3 representatives from each department
    (select staff who regularly use CUI-scope systems)
  Observers: CISO, IT Manager, HR Manager

Scope of systems in DR during exercise:
  CUI file server: in DR — staff access files from DR environment
  Email: not failed over (already cloud — continuous)
  Intranet / Confluence: [failed over or continuous — specify]
  CRM / business application: [specify if in DR scope]
  VPN: DR VPN endpoint active — staff connect via DR VPN

Exercise scenario:
  "Primary data centre is unavailable due to a power infrastructure failure.
   Recovery time for primary site is estimated at 6 hours.
   We are now operating on our DR environment.
   Your task for the next 2 hours is to continue your normal work 
   activities using the DR environment."

This is intentionally mundane. The goal is to discover what breaks 
during normal work on DR systems, not to test crisis response.

Exercise execution

T-2 weeks: notify participants; provide pre-reading on DR access procedures
T-1 week: confirm DR environment is in expected state
T-1 day: confirm all participants have tested DR access credentials 
          (if using DR-specific credentials rather than production SSO)

T+00:00 — CISO declares exercise start
  IT Operations initiates failover:
    DNS failover to DR (same procedure as Q2 test)
    VPN failover to DR endpoint
    Any application-level failover steps

  IT Operations monitor for failover completion:
    Confirm all target systems accessible in DR
    Record: time from declaration to DR ready = [actual RTO]

T+00:30 — Business participants begin work
  Staff connect via DR VPN
  Staff access CUI file server via DR environment
  Staff conduct normal work activities

  IT Operations monitors:
    Support tickets / instant messages from participants (expect some!)
    SIEM monitoring (DR environment should still forward logs)
    DR system performance

T+00:30 to T+02:30 — Active exercise
  CISO and IT Manager observe and document:
    Which functions work without issues?
    Which functions are degraded?
    Which functions are completely unavailable?
    What workarounds are staff using (and are those workarounds acceptable)?
    What questions are staff asking that indicate the DR procedures are unclear?

T+02:30 — CISO declares exercise end; initiate failback to production
  IT Operations conducts planned failback:
    Sync any data created in DR environment back to production
    Revert DNS to production
    Verify production systems are accessible
    Confirm DR environment returned to standby
  Record: time from declaration to production restored

T+03:00 — Hot debrief (30 minutes with all participants)
  Questions:
    What did you try to do that did not work?
    What was your biggest frustration with DR working?
    What would have helped you work more effectively?
    Were the DR access instructions clear?

Exercise report (within 10 days):
  Objective vs achievement: [target RTO: X] — [actual: X]
  Functions fully available in DR: [list]
  Functions degraded in DR: [list with specific issues]
  Functions unavailable in DR: [list — each needs a plan or a risk acceptance]
  Staff feedback summary: [key themes]
  Findings and action items: [specific improvements with owners and dates]
  DR documentation gaps identified: [list]

  Action items → EV-A04
  DR runbook updated where gaps identified
  Report filed and presented at next management review

Q4 — Combined cyber incident and BCM tabletop

This exercise satisfies both AT-IR 3.6.3 (test IR capability) and the 
BCM programme. See AT-IR EV-D15 for the exercise record template.

Additional BCM-specific injects for the Q4 exercise:

Alongside the cyber incident scenario (ransomware / significant breach),
include injects that force BCM decisions:

  T+6 hours: "The ransomware has encrypted the primary file server. 
               IT has confirmed that backup is intact but restoration 
               will take 12–16 hours. Should we fail over to DR 
               in the interim?"

  T+8 hours: "Two engineers are ill — we have reduced IT capacity 
               at exactly the moment we need maximum capacity. 
               Who can we call in? What is our minimum viable IT team?"

  T+12 hours: "The contracting authority (MOD / US DoD) is asking 
                whether our ability to fulfil contract obligations 
                is affected. What do we tell them?"

  T+24 hours: "The incident has been contained but systems are not 
                yet restored. Tomorrow is the payroll processing date. 
                The payroll system is in the affected environment. 
                What is our contingency?"

These injects force the IRT to think about business continuity 
simultaneously with incident response — which is the reality of a 
major cyber incident.

BCM-specific findings from Q4 exercise are documented in the exercise 
report alongside IR-specific findings. BCM action items → EV-A04.
BCP document updated based on findings.

Post-exercise follow-up — closing the loop

Every BCM/DR exercise must produce improvement, not just documentation.
The following loop ensures exercises drive actual change:

Within 24 hours of exercise:
  Hot debrief notes captured (can be rough notes)
  Critical findings escalated immediately (don't wait for the formal report)

Within 10 business days of exercise:
  Formal exercise report produced by CISO
  Action items entered in EV-A04 with named owners and due dates
  Report presented at next available management meeting

Within 30 days of exercise:
  BCP/DR documentation updated for any gaps identified
  Training updates for any procedure that was poorly followed
  Technical improvements scheduled (as Normal changes via RFC)

Before next exercise of the same type (12 months):
  All action items from the previous exercise should be closed or 
  have a documented, CISO-approved extended timeline
  If the same gap appears in two consecutive exercises, it is 
  escalated to the CISO as a persistent compliance risk

Evidence filing:
  Exercise reports: EV-D · BCM → Exercises → [YYYY] → [Q1/Q2/Q3/Q4]
  Action items: EV-A · Management System → Corrective Actions (EV-A04)
  DR test technical records: EV-D · BCM → DR Tests → [YYYY-Q#]
  Q4 exercise record: also filed as EV-D15 (AT-IR evidence)

Evidence summary — all IT Operations procedures

Evidence ID	Procedure	What it is	Frequency	Owner	Location
EV-D27	OP-01 Backup	Daily backup job completion log	Daily	IT Operations	EV-D → BCM → Backup Logs
EV-D28	OP-01 Backup	Quarterly restoration test record	Quarterly	IT Manager	EV-D → BCM → Restoration Tests → [YYYY-QQ]
EV-D30	OP-02 Certs	Certificate and key inventory	Monthly review	IT Operations	EV-D → Cryptography → Certificate Inventory
EV-F01	OP-03 SIEM	Monthly SIEM log review	Monthly	Security Analyst	EV-F → Continuous Monitoring → Log Reviews
EV-F06	OP-03 SIEM	Monthly SIEM health report	Monthly	IT Manager	EV-F → Continuous Monitoring → SIEM Health
EV-D21	OP-04 Change	Change management records (RFCs)	Per change	IT Manager	EV-D → Config Management → Change Log
BCM exercise records	OP-05 BCM	Quarterly BCM/DR exercise reports	Quarterly	CISO	EV-D → BCM → Exercises → [YYYY]
EV-D15	OP-05 BCM	Q4 combined exercise (also AT-IR evidence)	Annual	CISO	EV-D → Incident Response → Exercise Records

IT Operations Procedures Library — Owner: IT Manager. Evidence cross-references: AT-MP (backup encryption), AT-SC-ENC (certificate requirements), AT-AU (SIEM), AT-CM (change management), AT-IR (BCM/DR testing). Questions: IT Manager or CISO.