Machine learning algorithms are permeating our world. With applications in banking, investing, social media, advertising, and crime prevention, to name a few, these 'little black boxes' are increasingly being used to inform and drive decisions about our lives and businesses.

Yet, how do we know that in high-stakes situations—such as prison sentencing, airport security, combat, or how self-driving cars might choose between two catastrophic outcomes—the algorithms really are intellectually sound and in our best interest?

Machine learning models are evaluated and tested by their accuracy against a holdout of the model training data known as the "test set." Machine learning as a discipline attempts to predict an outcome and reduce prediction error over time by analyzing more and more data and recalibrating the algorithm based on that data. A model may, however, function well theoretically and mathematically, but still may not achieve what it aims to accomplish.

Husky or Wolf?

Carlos Guestrin, a professor at the University of Washington, wrote about just such an example, along with graduate students Sameer Singh and Marco Tulio huskyRibeiro, in their paper, "Why Should I Trust You?" Explaining the Predictions of Any Classifier. The academics examined a model that looked at pictures of wolves and huskies and attempted to distinguish between the two. The model did a good job of predicting whether the picture was of a wolf or husky, with one exception: All of the wolf pictures on which the model was trained had snow in the background, so occasionally when an image of a husky with snow in the background appeared, the image would be classified as a wolf and vice versa. Now, what if this algorithm was acting as a gatekeeper to letting dogs into a children's park? Well, you get the picture.

Currently, machine learning algorithms in the wild are curated by a few well-trained "data geeks" who are experts at acquiring data, cleaning it, transforming it, and making prediction models. These individuals, however, are not trained in the intricacies of risk analysis and general enterprise governance. Are these newly employed algorithms that are increasingly being used to guide decisions producing the desired effect for the organization?

A Job for Internal Auditwolf
An objective, third-party audit of these algorithms is needed to be sure the brain child of the few is serving the needs of the organizational many. Additionally, it most likely won't be long before legal precedence or regulatory scrutiny come to bear on these complex mathematical models that are increasingly affecting enterprise decision making. Who better to provide assurance of their accuracy than internal and external auditors?

Auditing machine learning algorithms is tricky, however, and requires extensive knowledge of the data science and machine learning fields, in addition to industry knowledge and organizational risk appetite. How would one go about evaluating the efficacy of an algorithm, even if these skills were readily available? Guestrin and his students produced the Local Interpretable Model-Agnostic Explanation (LIME) framework for evaluating classifiers and regressors, which are two common types of machine learning algorithms.

LIME Aid
LIME provides a framework for humans to evaluate and understand how an algorithm operates at a local level, meaning a number of samples are pulled from the model and mini-explanation models are built around the samples. From the previous example of the wolves vs. huskies, the mini-explanation model would explain why the sample image classified a certain wolf as a husky or husky as a wolf. This is brand new research, since the paper was just released in August 2016 at the KDD conference in San Francisco, and is still evolving. However, this research provides a good starting point for the task of making an ML Audit framework.

With machine learning algorithms becoming ever more ingrained within leading organizations, the need for an independent, objective verification of their assumptions and their effect on the enterprise's risk tolerance is needed. Internal and external auditors are in the perfect position to provide this assurance. The technical competence hurdles, however, are hard to overcome for many firms and departments. With the advent of model evaluation frameworks, such as Guestrin and team's LIME project, ML assurance can become a reality.


Andrew T. Clark is IT auditor of Astec Industries. The views expressed here are his own and are not intended to reflect those of any particular organization.