Media and pop culture love stories about the future rise of artificial intelligence and the potential of rogue machines. Dystopic Hollywood movies like the Terminator series picture the grim future of a desperate fight between a nearly extinct human race and a world-spanning synthetic intelligence. Even in mainstream science the current narrative is focused on a competition between human and artificial intelligence. Some researchers believe that AI will make human intelligence dispensable. Geoffrey Hinton, a distinguished person in the field of AI, suggested in 2016 that we should stop teaching radiologists now. What he meant of course is that AI will do the radiologists’ good work in the future. While this scenario is not nearly as threatening as the extinction of the human race (radiologists may disagree), it still challenges our professional pride. Other prominent researchers, however, do not share this point of view and it is still a matter of debate how exactly AI will influence medicine in the future (1). We took interest in this area of conflict and sought to shift the direction of this narrative more towards human/AI collaboration, at least in the field of diagnostic medicine. To this end we studied the use-case of skin cancer diagnosis. The initial idea was to explore the effects of varied representations of AI support across different levels of clinical expertise. With the help of the valuable suggestions and constructive critique from the reviewers and the editor of Nature Medicine we expanded this idea and ended up in exploring multiple clinical workflows and different scenarios including “rogue” AI.
Recent studies in dermatology demonstrated that AI for selected lesions is equivalent or even superior to human experts in image-based diagnosis under experimental conditions. In 2017, Esteva et al. pointed out the potential of state-of-the-art machine learning in the field of skin cancer detection (2) and in 2019 we demonstrated that the accuracy of even average machine learning algorithms is comparable to human experts (3). While these studies were all preliminary and may not translate to improved diagnostic accuracy in clinical practice, they were a good starting point to address the question of how humans and machines can work together as a team. For that reason we created an online platform to study human machine interaction in more detail. To attract as many experts as possible we gamified the task of skin cancer diagnosis. The task of the humans raters was to diagnose images of pigmented skin lesions first without and then with AI support. Because various representations of AI support may affect human raters differently, we offered 3 popular types of representations (multiclass probabilities, malignancy probability, and content based image retrieval) that differed in key characteristics such as simplicity, granularity, and concreteness. After the first hundred games played we already saw a trend towards higher accuracy of AI in comparison to humans but also a superior accuracy of human/AI collaboration in comparison to AI alone. Interestingly, only the AI-based multiclass probabilities improved the accuracy of human raters but not the other representations of AI-support including content based image retrieval (CBIR). This suggests that the form of decision support matters. The studied form of AI-based CBIR needed more extensive cognitive engagement in terms of time and decision-making. Humans minds are lazy and over time human raters tended to ignore the AI-based CBIR decision support. If you are interested how this experiment was performed you can still visit our platform and try it out yourself (DermaChallenge.com).
Testing clinically relevant blueprints
So this was all good - but again still quite preliminary. This is also what the reviewers and the editor thought when we first submitted the manuscript. They were right, of course. To make this study more relevant we needed a more general approach and a tighter connection to clinical practice. Stimulated by the reviewers’ comments we took a closer look at clinically significant scenarios. Our Australian co-authors are engaged in telemedicine and with their help we were able to reuse prospectively collected images from a telemedicine study of skin-self examination in high risk patients (4). Of note, the images of this study were collected in the real world and captured by patients and not by medical professionals. It would be encouraging if AI-support works in this difficult setting. And it did, although not perfectly. AI could successfully filter benign lesions or low-risk patients with acceptable accuracy. This means we could potentially use AI as a triage in telemedicine to relieve our small and constantly overworked population of experts. Imagine the consequences: More patients in need could have access to expert knowledge! AI could facilitate the access to health care! There is, however, still a long way to go. In our experiments AI missed a few malignant cases. The optimal operating points to balance the benefits of AI-based triage with the risk of filtering out patients with skin cancer remain to be determined. Hopefully, in the meantime, AI will get even better in performing this task. We made similar observations in another scenario when we asked dermatologists to rethink their face-to-face decisions with AI support. AI-support could have spared surgical procedures in some patients without increasing the number of false negatives. Could have? This experiment was conducted retrospectively. Regulations did not allow us to jump-start a prospective trial in such a short period of time. Your turn!
What about "rogue" AI?
Is it possible that AI-support does more harm than good? To this end we simulated faulty AI support by intentionally generating misleading AI-based multiclass probabilities. If the top class probability favoured the correct diagnosis, we switched the probabilities in such a way that the AI output favoured a random incorrect diagnosis. Nasty, isn’t it? We saw that all raters, novices and experts alike, were susceptible to underperform with faulty AI-support. Underperform means that their accuracy was worse with faulty AI than without AI, which is a bit scary. If raters built up the trust that is necessary to benefit from AI-support, they were also vulnerable to perform below their expected ability.
Can we learn something from machines?
We humans are very proud of our intelligence but we obviously have some deficits, known as cognitive biases. Could AI help us to overcome those biases? One deficit is related to visual entrenchment. We see only what is necessary in order to make a quick decision. We simplify complex scenes and disregard the seemingly irrelevant. But sometimes the irrelevant can be relevant. We got this idea by studying one of the references suggested by a reviewer. It was also helpful that one of the authors favourite books is “Thinking, fast and slow” by Daniel Kahnemann.
We knew from previous experiments that AI is much better than human experts in the diagnosis of pigmented actinic keratoses. But why? We hypothesize that due to visual entrenchment, humans focus on the lesion and not on the background and frequently miss the important clue of chronic UV damage in the surrounding skin. AI, on the other hand, pays attention to the background. By analysing Gradient-weighted Class-Activation-Mapping (Grad-CAM) we showed that AI-attention outside the object is higher for the prediction of actinic keratoses than for other categories. Could we teach humans to pay attention to the background too and will humans get better if they do? Luckily, one of our authors teaches dermatology to medical students at the Medical University of Vienna. We could demonstrate that teaching medical students to pay attention to chronic sun damage in the background improves their overall diagnostic accuracy for pigmented lesions, not only for pigmented actinic keratoses in particular.
Summary
It was an exciting journey for us. We examined human-computer collaboration from multiple angles and created a blueprint for similar studies in image-based diagnostic medicine. Our findings suggest that the primary focus of AI should shift from human-computer-competition to human-computer collaboration. Only then can we accelerate the evolution of AI-support in diagnostic medicine.
Philipp Tschandl, Christoph Rinner, Harald Kittler
References
1. Mukherjee, S. A.I. Versus M.D. The New Yorker (2017).
2. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
3. Tschandl, P. et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. Lancet Oncol. 20, 938–947 (2019).
4. Janda, M. et al. Accuracy of mobile digital teledermoscopy for skin self-examinations in adults at high risk of skin cancer: an open-label, randomised controlled trial. The Lancet Digital Health 2, e129–e137 (2020).
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in