Abstract
Objectives Our aim is to evaluate the performance of multimodal deep learning to classify skin lesions using both images and textual descriptions compared to learning only on images.
Materials and Methods We used the HAM10000 dataset in our study containing 10,000 skin lesion images. We combined the images with patients’ data (sex, age, and lesion location) for training and evaluating a multimodal deep learning classification model. The dataset was split into 70% for training the model, 20% for the validation set, and 10% for the testing set. We compared the multimodal model’s performance to well-known deep learning models that only use images for classification.
Results We used accuracy and area under the curve (AUC) receiver operating characteristic (ROC) as the metrics to compare the models’ performance. Our multimodal model achieved the best accuracy (94.11%) and AUCROC (0.9426) compared to its competitors.
Conclusion Our study showed that a multimodal deep learning model can outperform traditional deep learning models for skin lesion classification on the HAM10000 dataset. We believe our approach can enable primary care clinicians to screen for skin cancer in patients (residing in areas lacking access to expert dermatologists) with higher accuracy and reliability.
Lay Summary Skin cancer, which includes basal cell carcinoma, squamous cell carcinoma, melanoma, and less frequent lesions, is the most frequent type of cancer. Around 9,500 people in the United States are diagnosed with skin cancer every day. Recently, multimodal learning has gained a lot of traction for classification tasks. Many of the previous works used only images for skin lesion classification. In this work, we used the images and patient metadata (sex, age, and lesion location) in HAM10000, a publicly available dataset, for multimodal deep learning to classify skin lesions. We used the model ALBEF (Align before Fuse) for multimodal deep learning. We compared the performance of ALBEF to well-known deep learning models that only use images (e.g., Inception-v3, DenseNet121, ResNet50). The ALBEF model outperformed all other models achieving an accuracy of 94.11% and an AUROC score of 0.9426 on HAM10000. We believe our model can enable primary care clinicians to accurately screen for skin cancer in patients.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This project was funded by the Translational Research Informing Useful and Meaningful Precision Health (TRIUMPH) grant from the University of Missouri-Columbia.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study used only publicly available dataset. The HAM10000 dataset.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
This is the updated version
Data Availability
All data produced in the present study are available upon reasonable request to the authors