How do I mitigate fallout of business downtime due wrongfully applied security patch as a result of inconsistent terminology

I am a manager in the cybersecurity division where I work. I have been in management for about 6 years. Recently, our business suffered unexpected downtime due to a mis- applied security patch that crashed some systems. The reason the security patch was misapplied was due to inconsistent and loose usage of terminology, with same component being referred to with different but common colloquial terms. Details: This event happened in a virtual machine environment , with multiple physical and virtual machines. A certain patch to remediate a vulnerability (CVE) needed to be installed. The original instructions said patch was to be installed "on individual hosts" meaning the hypervisor and physical machines where VMs live, and not host in the generic sense for a computer. The word guest was not used in our instructions but only VM. The security engineer installing the patch was confused on what "hosts" meant here: loose reference to computers and VMs as a whole or hosts referring to the physical machines hosting the VM. As PROD work can only be done at night, additional support was not there. The next day, internal customers were upset that critical systems crashed. It's common where I work to refer to a computer by words like host, instance, and machine. In a non virtual environment, questions such as "can you ping the host to check uptime" would be easy to understand, but when language is loose in a virtualized environment, confusion arises. In the past, we had a near incident in which an engineer said "patch to be deployed to test environment" without saying whether that's QA or UAT or STAGING etc. Questions How should I respond to business partners to explain the RCA of the incident - inconsistent terminology? How to coach on the usage of consistent usage and common understanding of terms without being heavy handed or too prescriptive?

Mar 28, 2025 - 19:39
 0
How do I mitigate fallout of business downtime due wrongfully applied security patch as a result of inconsistent terminology

I am a manager in the cybersecurity division where I work. I have been in management for about 6 years.

Recently, our business suffered unexpected downtime due to a mis- applied security patch that crashed some systems. The reason the security patch was misapplied was due to inconsistent and loose usage of terminology, with same component being referred to with different but common colloquial terms. Details:

  • This event happened in a virtual machine environment , with multiple physical and virtual machines.

  • A certain patch to remediate a vulnerability (CVE) needed to be installed.

  • The original instructions said patch was to be installed "on individual hosts" meaning the hypervisor and physical machines where VMs live, and not host in the generic sense for a computer. The word guest was not used in our instructions but only VM.

  • The security engineer installing the patch was confused on what "hosts" meant here: loose reference to computers and VMs as a whole or hosts referring to the physical machines hosting the VM.

  • As PROD work can only be done at night, additional support was not there. The next day, internal customers were upset that critical systems crashed.

It's common where I work to refer to a computer by words like host, instance, and machine. In a non virtual environment, questions such as "can you ping the host to check uptime" would be easy to understand, but when language is loose in a virtualized environment, confusion arises.

In the past, we had a near incident in which an engineer said "patch to be deployed to test environment" without saying whether that's QA or UAT or STAGING etc.

Questions

  1. How should I respond to business partners to explain the RCA of the incident - inconsistent terminology?

  2. How to coach on the usage of consistent usage and common understanding of terms without being heavy handed or too prescriptive?