Blog: Accuracy of AI system outputs and performance measures
This article was first published on the Information Commissioner’s Office blog site
Accuracy is one of the key principles of data protection. It requires organisations to take all reasonable steps to make sure the personal data they process is not “incorrect or misleading as to any matter of fact” and, where necessary, is corrected or deleted without undue delay.
Accuracy is especially important when organisations use AI to process personal data and profile individuals. If AI systems use or generate inaccurate personal data, this may lead to the incorrect or unjust treatment of a data subject.
Discussions about accuracy in an AI system often focus on the accuracy of input data, i.e. the personal data of a specific individual used by an AI system to make decisions or predictions about an individual.
However, it is important to understand accuracy requirements also apply to AI outputs, both in terms of accuracy of decisions or predictions about a specific person and across a wider population.
In this blog, we take a closer look at what accuracy of AI outputs means in practice and why selecting the appropriate accuracy performance measures is critical, in order to ensure compliance and to protect data subjects.
Accuracy of AI outputs
If the output of an AI system is personal data, any inaccuracies as to “any matter of fact” can be challenged by data subjects. For example, if a marketing AI application predicted a particular individual was a parent when they in fact have no children, its output would be inaccurate as to a matter of fact. The individual concerned would have the right to ask the controller to rectify the AI output, under article 16 of the General Data Protection Regulation (GDPR).
Often however, AI outputs may generate personal data where there is currently no matter of fact. For example, an AI system could predict that someone is likely to become a parent in the next three years. This kind of prediction cannot be accurate or inaccurate in relation to ‘any matter of fact’. However, the AI system may be more or less accurate as a matter of statistics, measured in terms of how many of its predictions turn out to be correct for the population they are applied to over time. The European Data Protection Board’s (EDPB) guidance says in these cases individuals still have the right to challenge the accuracy of predictions made about them, on the basis of the input data and/or the model(s) used. GDPR also provides a right for the data subject to complement such personal data with additional information.
In addition, accuracy requirements are more stringent in the case of solely automated AI systems, if the AI outputs have a legal or similar effect on data subjects (article 22 of the GDPR). In such cases, the GDPR recital 71 states organisations should put in place “appropriate mathematical and statistical procedures” for the profiling of data subjects as part of their technical measures. They should ensure any factor that may result in inaccuracies in personal data is corrected and the risk of errors is minimised.
While it is not the role of the ICO to determine the way AI systems should be built, it is our role to understand how accurate they are and the impact on data subjects. Organisations should therefore understand and adopt appropriate accuracy measures when building and deploying AI systems, as these measures will have important data protection implications.
Accuracy as a performance measure: the impact on data protection compliance
Statistical accuracy is about how closely an AI system’s predictions match the truth. For example, if an AI system is used to classify emails as spam, a simple measure of accuracy would be the number of emails that were correctly classified as spam as a proportion of all the emails that were analysed.
However such a measure could be misleading. For instance, if 90% of all emails are spam, then you could create a 90% accurate classifier by simply labelling everything as spam. For this reason, alternative measures are usually used to assess how good a system is, which reflect the balance between two different kinds of errors:
- A false positive or ‘type I’ error: these are cases that the AI system incorrectly labels as positive (e.g. emails classified as spam, when they are genuine)
- A false negative or ‘type II’ error: these are cases that the AI system incorrectly labels as negative when they are actually positive (e.g. emails classified as genuine, when they are actually spam).
The balance between these two types of errors can be captured through various measures, including:
Precision: the percentage of cases identified as positive that are in fact positive (also called ‘positive predictive value’). For instance, if 9 out of 10 emails that are classified as spam are actually spam, the precision of the AI system is 90%.
Recall (or sensitivity): the percentage of all cases that are in fact positive that are identified as such. For instance, if 10 out of 100 emails are actually spam, but the AI system only identifies seven of them, then its recall is 70%.
There are trade-offs between precision and recall. If you place more importance on finding as many of the positive cases as possible (maximising recall), this may come at the cost of some false positives (lowering precision).
In addition, there may be important differences between the consequences of false positives and false negatives on data subjects. For example, if a CV filtering system selecting qualified candidates for an interview produces a false positive, then an unqualified candidate will be invited to interview, costing the employer and the applicant’s time unnecessarily. If it produces a false negative, a qualified candidate will miss an employment opportunity and the organisation will miss a good candidate. Organisations may therefore wish to prioritise avoiding certain kinds of error based on the severity and nature of the risks.
In general, accuracy as a measure depends on it being possible to compare the performance of a system’s outputs to some “ground truth”, i.e. checking the results of the AI system against the real world. For instance, a medical diagnostic tool designed to detect malignant tumours could be evaluated against high quality test data, containing known patient outcomes. In some other areas, a ground truth may be unattainable. This could be because no high quality test data exists or because what you are trying to predict or classify is subjective (e.g. offense), or socially constructed (e.g. gender).
Similarly, in many cases AI outputs will be more like an opinion than a matter of fact, so accuracy may not be the right way to assess the acceptability of an AI decision. In addition, since accuracy is only relative to test data, if the latter isn’t representative of the population you will be using your system on, then not only may the outputs be inaccurate, but they may also lead to bias and discrimination. These will be the subject of future blogs, where we will explore how, in such cases, organisations may need to consider other principles like fairness and the impact on fundamental rights, instead of (or as well as) accuracy.
Finally, accuracy is not a static measure, and while it is usually measured on static test data, in real life situations, systems will be applied to new and changing populations. Just because a system is accurate with respect to an existing population (e.g. customers in the last year), it may not continue to perform well if the characteristics of the future population changes. People’s behaviours may change, either of their own accord, or because they are adapting in response to the system, and therefore the AI system may become less accurate with time. This phenomenon is referred to in machine learning as ‘concept drift’, and various methods exist for detecting it. For instance, you can measure the estimated distance between classification errors over time
Further explanation of concept drift can be found on the Cornell University website.
What should organisations do?
Organisations should always think carefully from the start whether it is appropriate to automate any prediction or decision making process. This should include assessing if acceptable levels of accuracy can be achieved.
If an AI system is intended to complement, or replace, human decision-making then any assessment should compare human and algorithmic accuracy to understand the relative advantages, if any, various AI systems might bring. Any potential accuracy risk should be considered and addressed as part of any Data Protection Impact Assessment.
While accuracy is just one of multiple considerations when determining whether and how to adopt AI, it should be a key element of the decision-making process. This is particularly true if the subject matter is, for example, subjective or socially contestable. Organisations also need to consider if high quality test data can be obtained on an ongoing basis to establish a “ground truth”. Senior leaders should be aware that left to their own devices data scientists may not to distinguish between data labels that are objective or subjective, but this may be an important distinction in relation to the data protection accuracy principle.
If organisations decide to adopt an AI system, then they should:
- ensure that all functions and individuals responsible for its development, testing, validation, deployment, and monitoring are adequately trained to understand the associated accuracy requirements and measures; and
- adopt an official common terminology that staff can use to discuss accuracy performance measures, including their limitations and any adverse impact on data subjects.
Accuracy and the appropriate measures to evaluate it should be considered from the design phase, and should also be tested throughout the AI lifecycle. After deployment, monitoring should take place, the frequency of which should be proportional to the impact an incorrect output may have on data subjects, so the higher the impact the more frequently it is monitored. Accuracy measures should also be regularly reviewed to mitigate the risk of concept drift and change policy procedures should take this into account from the outset.
Accuracy is also an important consideration if organisations outsource the development of an AI system to a third party (either fully or partially) or purchase an AI solution from an external vendor. In these cases, any accuracy claim made by third parties needs to be examined and tested as part of the procurement process. Similarly, it may be necessary to agree regular updates and reviews of accuracy to guard against changing population data and concept drift.
Finally, the vast quantity of personal data organisations will need to hold and process as part of their AI systems is likely to put pressie on any pre-AI processes to identify and, if necessary, rectify/delete inaccurate personal data, whether it is used as input or training/test data. Therefore organisations will need to review their data governance practices and systems to ensure they remain fit for purpose.
We are keen to hear your thoughts on this topic and welcome any feedback on our current thinking. In particular, we would appreciate your views on the following two questions:
1) Are there any additional compliance challenges in relation to accuracy of AI systems outputs and performance measures we have not considered?
2) What other technical and organisational controls or best practice do you think organisations should adopt to comply with accuracy requirements?
Please share your views by leaving a comment below or by emailing us at AIAuditingFramework@ico.org.uk