Blog: Artificial intelligence accurately grades IBD disease severity – Healio
SAN DIEGO — Using artificial intelligence methods in ulcerative colitis, researchers showed that neural networks produce as sensitive and specific readings as expert physician readers when used in ideal conditions, according to data presented at Digestive Disease Week.
“Image recognition using deep learning and artificial intelligence techniques in neural networks can grade endoscopic images and morphology … with similar agreement and accuracy that human experts would have,” Ryan W. Stidham, MD, of the University of Michigan, said during his presentation. “In the immediate term, we are going to see these deep learning methods throughout gastroenterology allowing us to have assessments with lower bias and greater access to these types of objective interpretation methods – not just in IBD but in all of GI.”
Stidham said that mucosal healing is a key metric of therapeutic response and clinical outcomes so having broader clinical access to unbiased endoscopic review would help in assessments and predictions of outcomes.
“This isn’t a future concept. We are already seeing these methods being used throughout medicine,” Stidham said. “We are seeing AI being used in medical image interpretation to help with objectivity and reproducibility. … You can also use these image analysis methods to help improve sensitivity analysis. … But another aspect of this is these methods can broaden access to expert level assessments.”
Still image recognition
Stidham presented data from a retrospective single-center review from 2007 to 2017 in which he and his colleagues looked at Michigan’s endoscopic imaging database and the university’s electronic health record. They selected patients diagnosed with UC who had colonoscopy or sigmoidoscopy images available (n = 3,062; 16,514 images).
The researchers used 2,778 patients and their 14,862 images to train the deep learning system while 304 patients and their 1,652 images were used as the test set and read by both the system – after training – and the human reviewers. The deep learning model training was based on the Convolutional Neural Network (CNN) and the test group underwent the CNN and gave a Mayo score probability.
The images were scored by two human reviewers and the deep learning system on the basis of Mayo endoscopic subscore.
With the still images, the highest score given by the CNN system was assigned and compared with the human scores.
Between the two reviewers, there was a 72.1% overall agreement with greater agreement at the extremes with a Cohen’s kappa of 0.86 (95% CI, 0.85-0.87). Comparing that to the CNN model, the degree of agreement was similar, giving a Cohen’s kappa of 0.84 (0.83-0.86).
“There’s no significant difference between the agreement you see between two human reviewers and the agreement between the automated model and the adjudicated scores,” Stidham said.
When using the CNN model to delineate between remission (Mayo 0-1) and moderate activity (Mayo 2-3), “we get an excellent performance of the automated model,” Stidham said.
Sensitivity came in at 0.83 (95% CI, 0.808-0.854) while specificity measured at 0.96 (95% CI, 0.951-0.961). The automated model then had a positive predictive value of 0.86 (95% CI, 0.847-0.882) and a negative predictive value of 0.96 (95% CI, 0.951-0.961).
“This tells us that machines are able to appreciate the morphology of endoscopic disease the way humans can, but this is of course not what we do in practice. We don’t grade still images,” Stidham said. They then took the still image data and applied them to a full motion video.
Researchers took 50 consecutive UC patients who had good bowel prep (Boston bowel prep score 7) but were not undergoing CRC surveillance. They segmented the videos to 30 frames per second and low-quality frames were identified. The CNN model was used to grade every image in the video and took the proportion scores to give an overall score, comparing them to the human reviews.
“Looking at this relatively crude method of doing this, it actually doesn’t do that bad,” Stidham said.
Overall, 82% (41 out of 50) had the same subscore as the human endoscopist. Stidham showed that misclassifications were often due to increased bleeding post-biopsy or post-polypectomy or a short segment of high severity disease.
He further explained that inherent bias in any human reviewing group must be understood and translation of qualitative human review to exact quantitative assessments is a “continual challenge.”
Stidham added that different technology will be challenging to delineate and there has not been good proof that these AI models can handle poor bowel prep.
“While in principle, this looks like this works, we definitely have a bit more work to do to refine this to handle all the different scenarios these machines might face,” Stidham said. – by Katrina Altersitz
Stidham R, et al. Abstract 336. Presented at: Digestive Disease Week; May 18-21; San Diego, California.
Disclosure: Stidham reports acting as a consultant for AbbVie, Janssen, and Merck.