A metric is a quantitative measurement that can be interpreted in the context of a series of previous or equivalent measurements. Metrics are necessary to show how security activity contributes directly to security goals; measure how changes in a process contribute to security goals; detect significant anomalies in processes and inform decisions to fix or improve processes. Good management metrics are said to be S.M.A.R.T:
- Specific: The metric is relevant to the process being measured.
- Measurable: Metric measurement is feasible with reasonable cost.
- Actionable: It is possible to act on the process to improve the metric.
- Relevant: Improvements in the metric meaningfully enhances the contribution of the process towards the goals of the management system.
- Timely: The metric measurement is fast enough for being used effectively.
Metrics are fully defined by the following items:
- Name of the metric;
- Description of what is measured;
- How is the metric measured;
- How often is the measurement taken;
- How are the thresholds calculated;
- Range of values considered normal for the metric;
- Best possible value of the metric;
- Units of measurement.
Security Metrics are difficult to come by
Unfortunately, it is not easy to find metrics for security goals like security, trust and confidence. The main reason is that security goals are “negative deliverables”. The absence of incidents for an extended period of time leads to think that we are safe. If you live in a town where neither you nor anyone you know has ever been robbed, you feel safe. Incidents prevented can’t be measured in the same way a positive deliverable can, like the temperature of a room.
Metrics for goals are not just difficult to find; they are not very useful for security management. The reason for this is the indirect relationship between security activity and security goals. Intuitively most managers think that there is a direct link between what we do (which results or outputs) and what we want to achieve (the most important things: our goals). This belief is supported by real life experiences like making a sandwich. You buy the ingredients, go home, arrange them, and perhaps toast them and voilá: A warm sandwich ready to eat. The output sought (the sandwich) and the goal (eating a home made sandwich) match beautifully.
Unfortunately, there is no direct link every time. A good example can be research. There is not direct relationship between goals (discoveries) and the activity (experiments, publication). You can try hundreds of experiments and still not discover a cure for cancer. Same thing happens with security. The goals (trust, confidence, security) and the activity (controls, processes) are not directly linked.
When there is a direct link between activity and goal, like the temperature in a pot and the heat applied that pot, we know what decision to take if we want the temperature to drop: stop applying heat But, how will we make a network safer, adding (more accurate filtering), or summarising (less complexity) filtering rules? We don’t know. If a process produces dropped packets, more or less dropped packets won’t necessarily make the network more or less secure, just like a change in the firewall rules won’t necessarily make the network safer of otherwise.
The disconnect present in information security between goals and activity prevents goal metrics from being useful for management, as you can never tell if you are closer to your goals because of decisions recently taken on the security processes.
Goal metric examples:
- Instances of secret information disclosed per year. What can you do to prevent people with legitimate access to disclose that information?
- Use of system by unauthorised users per month. What can you do to prevent people from letting other users to use their accounts?
- Customers reports of misuse of personal data to the Data Protection Agency. Even if you are compliant, what can you do to prevent a customer to fill a report?
- Risk reduction per year of 10%. As risk depends on internal an external factors, what can you do to actually modify risk?
- Prevent 99% of incidents. How do you know how many incidents didn’t happen?
Actually useful security metrics
If metrics for goals are difficult to get, and are not very useful; what is a security manager to do? Measuring process outputs can be the answer. Measuring outputs is not only possible but very useful, as outputs contribute directly or indirectly to achieve security, trust and confidence. Using output metrics you can:
- Measure how changes in a process contribute to outputs;
- Detect significant anomalies in processes;
- Inform decisions to fix or improve the process.
There are seven basic types of process output metrics:
- Activity: The number of outputs produced in a time period;
- Scope: The proportion of the environment or system that is protected by the process. For example, AV could be installed in only 50% of user PCs;
- Update: The time since the last update or refresh of process outputs.
- Availability: The time since a process has performed as expected upon demand (up time), the frequency and duration of interruptions, and the time interval between interruptions.
- Efficiency / Return on security investment (ROSI): Ratio of losses averted to the cost of the investment in the process. This metric measures the success of a process in comparison to the resources used.
- Efficacy / Benchmark: Ratio of outputs produced in comparison to the theoretical maximum. Measuring efficacy of a process implies the comparison against a baseline.
- Load: Ratio of available resources in actual use, like CPU load, repositories capacity, bandwidth, licenses and overtime hours per employee.
Examples of use of these metrics:
- Activity: Measuring the number of new user account created per week, a sudden drop could lead to detecting that the new administrator is lazy, or that users started sharing user accounts, so they are not requesting them any more.
- Scope: In an organization with a big number of third party connections, measuring the number of connections with third parties protected by a firewall could lead to a management decision not to create more unprotected connections.
- Update: Measuring the update level of the servers in a DMZ could lead to investigating the root cause if the level goes above certain level.
- Availability: Measuring the availability of a customer service portal could lead to rethinking the High Availability Architecture used.
- Efficiency / Return on security investment (ROSI): Measuring the cost per seat of the Single Sign On systems of two companies being merged could lead to choose one system over the other.
- Efficacy / Benchmark: Measuring backup speed of two different backup systems could lead to choose one over the other.
- Load: Measuring and projecting the minimum load of a firewall could lead to taking the decision to upgrade pre-emptively.
There is an important issue to tackle when using output metrics; what I call the Comfort Zone. When there are too many false positives, the metrics is quickly dismissed, as it is not possible to investigate every single warning. On the other hand, when the metric never triggers a warning, there is a feeling that the metric is not working or providing value. The Comfort Zone (not too many false positives, pseudo-periodic warnings) can be achieved using an old tool from Quality Management, the control chart. The are some rules used in Quality Management to tell a warning, a condition that should be investigated from a normal statistical variation (Western Electric, Donald J. Wheeler's, Nelson rules), but for security management the best practice is adjusting the multiple of the standard deviation that will define the range of normal values for the metric until we achieve the Comfort Zone, pseudo-periodic warnings without too many false positives.
Using Security Management Metrics
There are six steps in the use of metrics: measurement, representation, interpretation, investigation and diagnosis.
Measurement: The measurement of the current value of the metric is periodic and normally refers to a window, for example: “9:00pm Sunday reading of the number of viruses cleaned in the week since the last reading” Measurements from different sources and different periods need to be normalized before integration in a single metric.
Interpretation: The meaning of a measured value is evaluated comparing the value of a measurement with a threshold, a comparable measurement, or a target. Normal values (those within thresholds) are estimated from historic or comparable data. The results of interpretation are:
- Anomaly: When the measurement is beyond acceptable thresholds.
- Success: When the measurement compares favourably with the target.
- Trend: General direction of successive measurements relative to the target.
- Benchmark: Relative position of the measurement or the trend with peers.
Incidents or poor performance take process metrics outside normal thresholds. Shewhart-Deming control charts are useful to indicate if the metric value is within the normal range, as values within the arithmetic mean plus/minus twice the standard deviation make more than 95.4% of the values of a normally distributed population. Fluctuations within the “normal” range would not normally be investigated.
Investigation: The investigation of abnormal measurements ideally ends with identification of the common cause, for example changes in the environment or results of management decisions, or a special cause (error, attack, accident) for the current value of the metric.
Representation: Proper visualisation of the metric is key for reliable interpretation. Metrics representation will vary depending on the type of comparison and distribution of a resource. Bar charts, pie charts and line charts are most commonly used. Colours may help to highlight the meaning of a metric, such as the green-amber-red (equivalent to on-track, at risk and alert) traffic-light scale. Units, the period represented, and the period used to calculate the thresholds must always be given for the metric to be clearly understood. Rolling averages may be used to help identify trends.
Diagnosis: Managers should use the results of the previous steps to diagnose the situation, analyse alternatives and their consequences and make business decisions.
- Fault in Plan-Do-Check-Act cycle leading to repetitive failures in a process -> Fix the process.
- Weakness resulting from lack of transparency, partitioning, supervision, rotation or separation of responsibilities (TPSRSR) -> Fix the assignment of responsibilities .
- Technology failure to perform as expected -> Change / adapt technology.
- Inadequate resources -> Increase resources or adjust security targets.
- Security target too high -> Revise the security target if the effect on the business would be acceptable.
- Incompetence, dereliction of duty -> Take disciplinary action.
- Inadequate training -> Institute immediate and/or long-term training of personnel.
- Change in the environment -> Make improvements to adapt the process to the new conditions.
- Previous management decision -> Check if the results of the decision were sought or unintended.
- Error -> Fix the cause of the error.
- Attack -> Evaluate whether the protection against the attack can be improved.
- Accident -> Evaluate whether the protection against the accident can be improved.
What management practices become possible?
A side effect of an Information Security Management System (ISMS) lacking useful security metrics is that security management becomes centred in activities like Risk Assessment and Audit. Risk Assessment considers assets, threats, vulnerabilities and impacts to get a picture of security and prioritise design and improvements while Audit checks the compliance of the actual information security management system with the documented management system with an externally defined management system or an external regulation. Risk Assessment and Audit are valuable, but there are more useful security management activities like monitor, test, design & improvement and optimisation that become possible with output metrics. Theses activities can be described as follows:
- Monitor—Use metrics to watch processes outputs, detect abnormal conditions and assess the effect of changes in the process.
- Test—Check if inputs to the process produce the expected outputs.
- Improving - Making changes in the process to make it more suitable for the purpose, or to reduce usage of resources.
- Planning - Organising and forecasting the amount, assignment and milestones of tasks, resources, budget, deliverable and performance of a process.
- Assessment - How well the process matches the organisation's needs and compliance goals expressed as security objectives. How changes in the environment or management decisions in a process change the quality, performance and use of resources of the process; Whether bottlenecks or single points of failure exist; Points of diminishing returns; Benchmarking of processes between process instances and other organisations. Trends in quality, performance and efficiency.
- Benefits realisation. Shows how achieving security objectives contributes to achieving business objectives, measures the value of the process for the organisation, or justifies the use of resources.
While audits can be performed without metrics, monitoring, testing, planning, improvement and benefits realisation are not feasible without them.
What needs to be done?
S.M.A.R.T security managers need metrics that actually help them performing management activities.
While it is not necessary to drop goal metrics altogether, the day to day focus of information security management should be on security monitoring, testing, design & improvement and optimization using output metrics, which are the ones which will show what are the effect of management decisions, if things are getting worse or better, if processes work as designed, and if there are changes out of our direct control that cause abnormal conditions in security processes. All these activities are perfectly feasible using outputs metrics and control charts.
The information security industry recognizes both the necessity and the difficulty of carrying out a quantitative evaluation of ROSI, return on security investment.
The main reason for investing in security measures is to avoid the cost of accidents, errors and attacks. Direct costs of an incident may include lost revenues, damages and property loss, or direct economic loss. The total cost can be considered to be the direct cost plus the cost of restoring the system to its original state before the incident. Some incidents can cost information, fines, or even human lives.
The indirect cost of an incident may include damage to a company’s public image, loss of client and shareholder confidence, cash-flow problems, breaches of contract and other legal liabilities, failure to meet social and moral obligations, and other costs.
What do we know intuitively about the risk and cost of security measures? First, the relationship between the factors that affect risk - such as window of opportunity, value of the asset and its value to the attacker, combined assets, number of incidents and their cost, etc. - is quite complex. We also know that when measures are implemented to reduce risk, the ease of using and managing systems also decreases, generating an indirect cost of the security measures.
How do we go from this intuitive understanding to quantitative information? There is some accumulated knowledge of the relationship between investment in security measures and their results. First, there is the Mayfield paradox, according to which the cost of universal access to a system and absolutely restricted access is infinite, with more acceptable costs corresponding to the intermediate cases.
An empirical study was also done by the CERT at Carnegie Mellon University, which states that the greater the expenditure on security measures, the smaller the effect of the measures on security. This means that after a reasonable investment has been made in security measures, doubling security spending will not make the system twice as secure.
The study that is most easily found on the Internet on this subject cites the formulas created during the implementation of an intrusion detection system by a team from the University of Idaho.
E: prevented losses
T: total cost of security measures
R-ALE = ROSI, therefore ROSI = E-T
The problem with this formula is that E is merely an estimate, and even more so if the measure involved is an IDS, which simply collects information on intrusions, which means that there is no cause-effect relationship between detecting an intrusion and preventing an incident. Combining this type of estimate with basing it on mathematical formulas is like combining magic with physics.
What problems do we face in calculating return on investment of security measures? The most important is the lack of concrete data, followed closely by a series of commonly accepted suppositions and half-truths, such as that risk always decreases as investment increases, and that the return on the investment is positive for all levels of investment.
Nobody invests in security measures to make money; they invest in them because they have no choice. Return on investment demonstrates that investing in security is profitable, in order to select the best security measures with a given budget, and to determine whether the budget allocated to security is sufficient to fulfill the business objectives, but not to demonstrate that companies make money off of the investment.
In general, and also from the point of view of return on investment, there are two types of security measures: measures to reduce vulnerability and measures to reduce impact.
- Measures that reduce vulnerability barely reduce the impact when an incident does occur. These measures protect against a narrow range of threats. They are normally known as Preventive Measures. Some of these measures are firewalls, padlocks, and access control measures. One example of the narrowness of the protection range is the use of firewalls, which protect against access to unauthorized ports and addresses, but not against the spread of worms or spam.
- Measures that reduce impact to very little to minimize vulnerability if an incident does occur. These measures protect against a broad range of threats and are commonly known as Corrective Measures. Examples of these measures include RAID disks, backup copies, and redundant communication links. One example of the range of protection is the use of backups, which do not prevent incidents, but do protect against effective information losses in the case of all types of physical and logical failures.
The profitability of both types of measures is different, as the rest of the article will show.
Preventive or Vulnerability-Reduction Measures
A reduction in vulnerability translates into a reduction in the number of incidents. Security measures that reduce vulnerability are therefore profitable when they prevent incidents for a value that is higher than the total cost of the measure during that investment period.
The following formula can be used:
ROSI = CTprevented / TCP
CT = Cost of Threat = Number of Incidents * Per Incident Cost.
TCP = Total Cost of Protection
When ROSI > 1, the security measure is profitable.
Several approximations can be used to calculate the prevented cost. One takes the prevented cost into account as the cost of the threat in a period of time before and after the implementation of the security measure.
CTprevented = ( CTbefore – CTafter)
Calculating the cost of the threat as the number of incidents multiplied by the cost of each incident is an alternative with respect to the traditional calculation of the incident probability multiplied by the incident cost, provided that the number of incidents in the investment period is more than 1. To calculate a probability mathematically, the number of favorable cases and the number of possible cases must be known. Organizations rarely have information on possible cases (but not “favorable” cases) of incidents. It is impossible to calculate the probability without this information. However, it is relatively simple to determine the number of incidents that occur within a period of time and their cost.
For a known probability to be predictive, it is also necessary to have a large enough number of cases, and conditions must also remain the same. Taking into account the complexity of the behavior of attackers and the organizations that use information systems, it would be foolish to assume that conditions will remain constant. Calculating the cost of a threat using probability information is therefore unreliable in real conditions.
One significant advantage of calculating the cost of a threat as the product of the number of incidents and their unit cost is that this combines the cost of the incidents, the probability, and the total assets (since the number of incidents partly depends on the quantity of the total assets) into a single formula. To make a profitability calculation like this, real information on the incidents and their cost is required, and gathering this information generates an indirect cost of an organization’s security management. If this information is not available, the cost of the threats will have to be estimated to calculate the ROSI, but the value of the calculation result will be low as the estimate can always be changed to generate any desired result.
The profitability of a vulnerability reduction measure depends on the environment. For example, in an environment in which many incidents occur, a security measure will be more profitable than in the case of another environment in which they do not occur. While using a personal firewall on a PC connected to the Internet twenty-four hours a day may be profitable, using one on a private network not connected to the Internet would not. Investing in a reinforced door would be profitable in many regions of Colombia, but in certain rural areas of Canada, this investment would be a waste of money.
Sample profitability calculation:
- Two laptops out of a total of 50 are stolen in a year.
- The replacement cost of a laptop is 1800 euros.
- The following year, the company has 75 laptops.
- The laptops are protected with 60€ locks.
- The following year only one laptop is stolen.
ROSI = ( Rbefore – Rafter) / TCP
ROSI = ( ( 1800+Vi )*3 - (( 1800+Vi )*1+75*60) )/( 75*60 )
(The number of incidents is adjusted for the increase in the number of targets).
If a laptop was worth nothing (Vi=0), the security measure would not be profitable (ROSI < 1). In this example, the 60€ locks are profitable when a laptop costs more than 2700€, or when, based on historical information, the theft of 5 laptops can be expected for the year in question.
Using this type of analysis, we could:
- Use locks only on laptops with valuable information.
- Calculate the maximum price of locks for all laptops (24€ when Iv=0).
Corrective or Impact-Reduction Measures
Since impact-reduction measures do not prevent incidents, the previous calculation cannot be applied. In the best case scenario, these measures are never used, while when there are two incidents which could result in the destruction of the protected assets, they are apparently worth twice the value of the assets. Now then, who would spend twice the value of an asset on security measures? Profitability of corrective measures cannot be measured. These measures are like insurance policies; they put a limit on the maximum loss suffered in the case of an incident.
What is important in the case of impact-reduction measures is the protection that you get for your money. The effectiveness of this protection can be measured, for example depending on the recovery time after an incident. Depending on their effectiveness, there are measures that range from backup copies (with some added cost) to fully redundant systems (which cost more than double).
One interesting alternative to calculating the ROSI of a specific security measure is to measure the ROSI of a set of measures – including detection, prevention, and impact reduction – that protect an asset. In this case, the total cost of protection (TCP) is calculated as the sum of the cost of all of the security measures, which the effort to obtain the information on the cost of the threats is practically identical.
Budget, cost, and selection of measures
The security budget should be at most equal to the annual loss expectancy (ALE) caused by attacks, errors, and accidents in information systems for a tax year. Otherwise, the measures are guaranteed not to be profitable. The graph below shows the expected losses as the area under the curve. To clarify the graph, it represents a company with enormous expected losses, of almost 25% of the value of the company. In the case of an actual company, legibility of the graph could be improved using logarithmic scales.
An evaluation of the cost of a security measure must take into account both the direct costs of the hardware, software, and implementation, as well as the indirect costs, which could include control of the measure by evaluating incidents, ethical hacking (attack simulation), audits, incident simulation, forensic analysis, and code audits.
Security measures are often chosen based on fear, uncertainty and doubt, or out of paranoia, to keep up with trends, or simply at random. However, the calculation of the profitability of security measures can help to select the best measures for a particular budget. Part of the budget must be allocated to the protection of critical assets using impact-reduction measures, and part to the protection of all of the assets using vulnerability-reduction measures and incident and intrusion detection measures.
The main conclusions that can be drawn from all of this are that:
- To guarantee maximum effectiveness of an investment, it is necessary, and possible if the supporting data is available, to calculate the return on the investment of vulnerability-reduction measures.
- In order to make real calculations, real information is needed regarding the cost of the incidents for a company or in comparable companies in the same sector.
- Both incidents and security measures have indirect and direct costs that have to be taken into account when calculating profitability.