<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with MathML3 v1.1d2 20140930//EN" "JATS-journalpublishing1-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="1.1d2" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="nlm-ta">BISH</journal-id>
      <journal-id journal-id-type="publisher-id">IECE</journal-id>
      <journal-title-group>
        <journal-title>Biomedical Informatics and Smart Healthcare</journal-title>
      </journal-title-group>
      <issn pub-type="ppub" publication-format="print">pending</issn>
      <issn pub-type="epub" publication-format="electronic">pending</issn>
      <publisher>
        <publisher-name>Institute of Emerging and Computer Engineering Inc</publisher-name>
        <publisher-loc>522 W RIVERSIDE AVE STE N, SPOKANE, WA, 99201-0508, UNITED STATES</publisher-loc>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.62762/BISH.2025.724307</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Research Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Diabetic Retinopathy Detection and Analysis with Convolutional Neural Networks and Vision Transformer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0009-0006-8419-9003</contrib-id>
          <name>
            <surname>Tewari</surname>
            <given-names>Yogesh</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0009-0008-3420-1018</contrib-id>
          <name>
            <surname>Parihar</surname>
            <given-names>Nitin Singh</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0009-0007-5065-6074</contrib-id>
          <name>
            <surname>Rautela</surname>
            <given-names>Karan</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0009-0006-2990-9738</contrib-id>
          <name>
            <surname>Kaundal</surname>
            <given-names>Nishant</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-4435-675X</contrib-id>
          <name>
            <surname>Diwakar</surname>
            <given-names>Manoj</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-4948-8513</contrib-id>
          <name>
            <surname>Kumar</surname>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff1"><label>1</label>Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun 248002, India</aff>
      </contrib-group>
      <author-notes>
        <corresp id="cor5">Corresponding Author: Manoj Diwakar. Email: <email>manoj.diwakar@gmail.com</email></corresp>
      </author-notes>
      <pub-date date-type="pub" pub-type="epub" publication-format="online">
        <day>03</day>
        <month>6</month>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <fpage>18</fpage>
      <lpage>26</lpage>
      <history>
        <date date-type="received">
          <day>30</day>
          <month>3</month>
          <year>2025</year>
        </date>
        <date date-type="accepted">
          <day>07</day>
          <month>5</month>
          <year>2025</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2025 by the Authors. Published by Institute of Emerging and Computer Engineers. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).</copyright-statement>
        <copyright-year>2025</copyright-year>
        <copyright-holder>Institute of Emerging and Computer Engineering Inc</copyright-holder>
        <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
        <license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
        </license>
      </permissions>
      <self-uri xlink:href="https://www.iece.org/article/abs/bish.2025.724307">This article is available from https://www.iece.org/article/abs/bish.2025.724307</self-uri>
      <abstract>
        <p>Diabetic Retinopathy occurs when elevated blood sugar levels damage retinal blood vessels, potentially leading to vision impairment. In this paper, we have tested the performance of CNN, ViT and their hybrid models. The dataset used is publicly available on Kaggle and the dataset contained around 35,000 retinal images which were divided into 5 classes namely No DR, Mild DR, Moderate DR, Severe DR and Proliferative DR. In CNN we tested 4 different architectures in which we achieved the best accuracy of 75.4% with Resnet50 architecture and with ViT model we achieved an accuracy of 83.9% and from the hybrid model we achieved an accuracy of 88.4% from the Resnet50 + ViT. The results shown by the models were promising but there were some gaps in the study. The dataset used was skewed towards NO DR class. For future work more balanced datasets with some data augmentation techniques could be used. Additionally, the study used only 50 epochs which can be increased in future work to use the model to their full potential.</p>
      </abstract>
      <kwd-group kwd-group-type="author" xml:lang="en">
        <kwd>diabetic retinopathy</kwd>
        <kwd>CNN</kwd>
        <kwd>ViT</kwd>
        <kwd>deep learning</kwd>
        <kwd>image classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="S1">
      <label>1.</label>
      <title>Introduction</title>
      <p id="S1.p1">Diabetic retinopathy (DR) is the condition which happens when there is a sudden increase in the blood sugar level which drastically has a great impact on the blood vessel of retina. It significantly affects individuals with diabetes, potentially causing complete vision loss and blindness. Early detection during the initial stages (Mild DR) can prevent progression to severe vision loss. Starting from mild DR to moderate DR as the blood sugar level increases, the growth of tissue also increases. The final stage is Proliferative DR where growth of such abnormal tissue increases leading to complete vision loss. Thus, detection of such disease in early stages can be an impactful way to reduce diabetic retinopathy. However, there are many areas where the doctors who treat eyes particularly called ophthalmologists and proper equipment are not available. The growth in population of people who are drastically being affected by diabetic retinopathy is rising which eventually leads to a shortage of healthcare centers. Eventually the eye doctor's time will devote to how to cure it rather than classifying it. Thus, making a work for AI tools to solve this classifying problem. There are various neural networks which provide efficient results in the realm of classification problems. We can now easily, with the help of computer vision tools, automate the process, reducing the burden on doctors. It is now possible to classify different stages of DR, which will help in early diagnosis and give further recommendations. In recent years we have seen progress in deep learning for medical image analysis, especially for retinal disease diagnosis. In this study we compare and analyze different deep learning models including Convolutional Neural Network (CNN), Vision Transformers (ViT), a hybrid model of ResNet-50+ViT and lastly hybrid model of EfficientNet-B0+ViT combination. These models are evaluated on a Kaggle dataset naming Diabetic Retinopathy (resized) which consist of almost 35000 retinal images labelled according to five diabetic retinopathy stages. Convolutional Neural Networks (CNNs) are widely used for image classification tasks due to their high feature extraction capabilities. They consist of different layers, namely convolutional layers used for detection of patterns like edge, features, structure of retinal images. Reduction of spatial dimension while keeping essential features is done by pooling layer. Fully connected layer used for classifying the extracted feature to one of the five diabetic retinopathy stages. CNN are helpful in capturing local spatial features but make struggle in long range dependency in a complex image. Vision Transformers (ViTs) have significantly gained popularity as an alternative of CNN as it offers an alternative approach by utilizing self-attention mechanism to support long range dependency across a complex retinal image. Unlike CNN, which processes an image using spatial hierarchies, ViT divides the retinal image into patches of fixed size which are thereby used as tokens in a sequence. Then these patches are carried forward through a series of multi head self-attention layers which helps in understanding relationship between different parts of retinal image. ViTs are helpful where capture of contextual relationships is crucial. However, it takes a large amount of training data and computational power to give results. To overcome the limitation of independent CNN and ViT, we implement a hybrid model that integrates both architectures. One such model is a hybrid of ResNet-50 and ViT. ResNet-50, a deep CNN architecture with residual connection, used for feature extraction and learning capabilities. By combination of this hybrid model uses CNN's local feature extraction while utilizing ViT's global self-attention mechanisms for better image classification. Combination of these two helps in capturing diabetic retinopathy related abnormalities which could not be done by CNN independently. Additionally, we implemented another hybrid model that combines EfficientNet-B0 with ViT. EfficientNet-B0 is an optimization of CNN architecture which balances accuracy and computational efficiency by scaling depth, width and resolution in a structured manner. By this combination it reduces computational costs thus making it promising for diabetic retinopathy analysis. This study aims to make a comparison of these four-model architecture-CNN, ViT, ResNet-50+ViT and EfficientNet-B0+ViT in detecting and classifying on a publicly available Kaggle dataset consisting of retinal images labelled for five classes. Each model is evaluated to determine which efficient approach for diabetic retinopathy detection. By developing such a reliable diabetic retinopathy detection deep learning system, this research helps ophthalmologists for early detection and management of diabetic retinopathy and contribute to AI-driven advancement on retinal images.</p>
    </sec>
    <sec id="S2">
      <label>2.</label>
      <title>Related Work</title>
      <p id="S2.p1">In the area of Healthcare, Abramoff et al. [<xref rid="ref001" ref-type="bibr">1</xref>] explored a key trail on Autonomous AI-Based Diagnostic System for Diabetic Retinopathy. It helps people to get primary care without specialist consultations. This research highlighted the system's high accuracy and efficiency in Determining DR. It also shows the importance of integrating such AI systems into the healthcare sector for early detection and management of DR. The study conducted by Ting et al. [<xref rid="ref002" ref-type="bibr">2</xref>] reviews the global prevalence of DR. It also reviewed the risk factors associated. It shows the lack of facilities across the country for screening practices. They show the need for a robust public health program to address DR detection and treatment obstacle. In this competition conducted by Kaggle named APTOS-2019 Challenge Kartik [<xref rid="ref003" ref-type="bibr">3</xref>] developed AI model for DR Detection .In this report , he was provided by dataset with DR severity levels ranging from 0(no DR) to 4(Proliferative DR).On basis of various performance matrices like accuracy , sensitivity and integrity he established a robust dataset and evaluation framework which helps researchers to advance DR detection models.</p>
      <p id="S2.p2">The research Enhanced U-Net for Diabetic Retinopathy Segmentation conducted by Agarwal [<xref rid="ref004" ref-type="bibr">4</xref>] showcased the DR detection by performing Image segmentation on Retinal images using enhanced U-Net model. He used IDRiD dataset to perform lesion segmentation. By using this he was able achieve higher accuracy compared to traditional U-Net models as it isolated affected regions in retinal images increases better preprocessing. Dosovitskiy et al. [<xref rid="ref005" ref-type="bibr">5</xref>] proposed the Vision Transformer (ViT) in their work An Image is Worth 16×16 Words: Transformers for Image Recognition. The model applies a self-attention mechanism to image recognition tasks, enabling it to effectively capture both global and local features. Their study demonstrated the scalability and superior performance of Transformer-based architectures compared to conventional convolutional neural networks (CNNs). Gulshan et al. [<xref rid="ref006" ref-type="bibr">6</xref>] used the Harris Hawk Optimization (HHO) algorithm which was inspired by hawks' cooperative hunting strategies. In this they applied HHO to hyperparameter tuning which helps in performance of neural network. By this study, they able to demonstrate potential for optimizing complex models to increase performance like ViT in DR detection tasks. Zhai et al. [<xref rid="ref007" ref-type="bibr">7</xref>] discussed various methods to enhance Vision Transformers for tackling larger datasets. They improved stability and computational efficiency for large scale image recognition tasks. By doing this ViTs make them ideal for DR detection where larger datasets like APTOS-2019 are commonly in use.</p>
      <p id="S2.p3">In the study published by Kobat et al. [<xref rid="ref008" ref-type="bibr">8</xref>] shows detection of DR using Pre-Trained DenseNet with Digital Fundus Images. Here they use horizontal and vertical patch division. This model extracts deep features in both three class and five class classifications. The patching and hybrid model increases localization and robustness. Tanlikesmath et al. [<xref rid="ref009" ref-type="bibr">9</xref>] created the dataset of eye images for doing study for diabetic retinopathy analysis. He used both resized and cropped images for dataset creation. By this he was able to create a data set consisting of 35 thousand images. Vaswani et al. [<xref rid="ref010" ref-type="bibr">10</xref>] in field of transformer model, they marked a go off from RNNs and CNNs. In their study they set up benchmarks for NLP tasks. Their efforts continue to refine the transformer and explore in areas such as computer vision. It helps to extend impacts in language processing.</p>
      <p id="S2.p4">In the study published by Li et al. [<xref rid="ref011" ref-type="bibr">11</xref>] they use a dataset of 13673 images from 9598 patients. Additionally, 757 images were manually annotated for lesion detection. By using this dataset, they achieve an accuracy of 0.8284.Despite high classification accuracy. model failed with precise lesion localization. It underscored the complexity of the task. Under this study Staal et al. [<xref rid="ref012" ref-type="bibr">12</xref>] they performed the retinal vessel segmentation for early detection of diabetic retinopathy. To solve this problem, they used a ridge-based segmentation approach. They used KNN classifier for the following approach combined with sequential forward feature selection. Performance was evaluated on a dataset of 40 manually labelled retinal images. It achieved a ROC curve of 0.952 and accuracy of 0.944. Quellec et al. [<xref rid="ref013" ref-type="bibr">13</xref>] used deep image mining for diabetic retinopathy screening. They trained ConvNet to detect image with DR. In this study they performed supervised with image refer to DR or not only. They used a public dataset of 90000 images. This model is at last able to outperform lesion detectors in DiaretDB1 dataset. Szegedy et al. [<xref rid="ref014" ref-type="bibr">14</xref>] they stated that CNN have revolutionized the image recognition, that enables in advancement of accuracy and efficiency. By this study, they concluded that Inception network optimizes computational efficiency with carrying high accuracy, whereas ResNet allows for deeper models with improved gradient flow. This study leads to work on hybrid models in future to enhance the result and accuracy. Decencière et al. [<xref rid="ref015" ref-type="bibr">15</xref>] created a hybrid model BrownViTNet to solve the problem of Brownfield site detection. They created an architecture consisting of four initial convolutional layers with intermediate layers using a Vision transformer (ViT).AS a result, this model achieves faster convergence and better generalization as compared to simple CNN models. It also improved feature representations. Sugeno et al. [<xref rid="ref016" ref-type="bibr">16</xref>] used publicly available Kaggle Asia Pacific Tele-Ophthalmology Society APTOS 2019 training dataset. They used EfficientNet-B3 model and train model on this dataset. They achieved classification accuracy for top predicated label as 0.84,0.95 for second prediction and 0.98 for third prediction. To enhance DR analysis, they performed lesion detection. Key observations under this are simultaneous detection of blood vessels and red lesions, accurate extraction of white lesions and validation using DIARETDB1 dataset. Usman et al. [<xref rid="ref017" ref-type="bibr">17</xref>] applied three differ states of the art CNN architectures, name as ResNet50, ResNet152 and SqueezeNet1 to classify the lesions. Under this study ResNet50 achieved accuracy of 93.67%, SqueezeNet1 achieved accuracy of 91.94% and ResNet152 achieved highest accuracy among all three of 94.40%.</p>
      <p id="S2.p5">In the study achieved by Willis et al. [<xref rid="ref018" ref-type="bibr">18</xref>] reinforces the critical role of early detection and severity assessment in helping the patients. In this study they analysis 1004 adults aged 40 and above with diabetics. Dr Severity marked using Early Treatment Diabetic Retinopathy Study (ETDRS) severity scale. It helps them to refining deep learning model for DR classification, developing treatment strategies. The studies reviewed in this survey show improvement of DR detection using various models such as shown in [<xref rid="ref019" ref-type="bibr">19</xref>, <xref rid="ref020" ref-type="bibr">20</xref>, <xref rid="ref021" ref-type="bibr">21</xref>, <xref rid="ref022" ref-type="bibr">22</xref>, <xref rid="ref023" ref-type="bibr">23</xref>, <xref rid="ref024" ref-type="bibr">24</xref>]. It helps in enhancing technology by using diverse datasets. It helps to establish applications and detect them without specialist requirements.</p>
    </sec>
    <sec id="S3">
      <label>3.</label>
      <title>Methodology</title>
      <p id="S3.p1">Chronic diabetes elevates blood sugar levels, damaging retinal blood vessels. This can affect the retina by either blurriness or in some cases complete vision loss. This research provides the necessity of Convolutional Neural Networks (CNNs) and Vision Transformer (ViTs) in detecting the impact of diabetes on retina. The essential steps here used are data collection, data preprocessing, model training and then comparing all of them together and choosing which one of the following will be best for our study.</p>
      <sec id="S3.SS1">
        <label>3.1</label>
        <title>Data Collection</title>
        <p id="S3.SS1.p1">The dataset that we have used here is downloaded from Kaggle named Diabetic Retinopathy (Resized) which consists of 35126 unique retinal images which were labelled on five classes namely No DR (0), Mild DR (1), Moderate DR (2), Severe DR (3), Proliferative DR (4). The size of the dataset is 7.95GB. Below are some attached retinal images from the dataset.</p>
        <p>
          <fig id="F1">
            <label>Figure 1.</label>
            <caption>
              <p>Retinal Images from dataset.</p>
            </caption>
            <graphic xlink:href="fig1.png"/>
          </fig>
        </p>
      </sec>
      <sec id="S3.SS2">
        <label>3.2</label>
        <title>Data Pre-Processing</title>
        <p id="S3.SS2.p1">In this step we did data augmentation which basically means flipping, rescaling, adding some angle and finally resizing the image into size of 224 X 224 pixels. The data is split into the ratio of 60:20:20 which means approximate 21000 images were kept for data training whereas for data testing we have 7000 images and for the validation part we have 7000 retinal images. These models CNN and ViT cannot process the image directly, so we need to convert them. For CNN we convert the image into NumPy array of shape (224,224,3) and labelled them into 5 classes by one hot encoding. On the other hand, ViT accepts the input in the form of tensor (3,224,224) and labels the images in form of integers 0 for No DR, 1 for Mild DR, 2 for Moderate DR, 3 for Severe DR and 4 for Proliferative DR. Figure <xref ref-type="fig" rid="F1">1</xref> shows the sample retinal images.</p>
      </sec>
      <sec id="S3.SS3">
        <label>3.3</label>
        <title>Model Architecture</title>
        <sec id="S3.SS3.SSS1">
          <label>3.3.1</label>
          <title>ResNet-50</title>
          <p>
            <fig id="F2">
              <label>Figure 2.</label>
              <caption>
                <p>ResNet-50 Architecture.</p>
              </caption>
              <graphic xlink:href="fig2.png"/>
            </fig>
          </p>
          <p id="S3.SS3.SSS1.p1">ResNet-50 as shown in Figure <xref ref-type="fig" rid="F2">2</xref>, is a deep convolutional neural network (CNN) that aims to address the vanishing gradient issue in deep networks through residual learning. ResNet-50 consists of 50 layers, including convolutional layers, batch normalization layers, activation functions, and skip connections (identity shortcuts). Efficient gradient flow is facilitated through the skip connections, and therefore deeper networks can be trained without compromising performance. The model receives input images resized to <inline-formula><mml:math alttext="224\times 224\times 3" display="inline"><mml:mrow><mml:mn>224</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>224</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:math></inline-formula> that first pass through an initial <inline-formula><mml:math alttext="7\times 7" display="inline"><mml:mrow><mml:mn>7</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:math></inline-formula>-sized convolutional layer, batch normalization, and then a ReLU activation function. A max-pooling layer subsequently compresses spatial dimensions before the data are fed into main residual blocks. ResNet-50 consists of four stages each containing multiple residual blocks. A block contains three convolutional layers which are a <inline-formula><mml:math alttext="1\times 1" display="inline"><mml:mrow><mml:mn>1</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula> convolution for dimension reduction, a <inline-formula><mml:math alttext="3\times 3" display="inline"><mml:mrow><mml:mn>3</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:math></inline-formula> convolution for feature extraction and a <inline-formula><mml:math alttext="1\times 1" display="inline"><mml:mrow><mml:mn>1</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula> convolution to restore dimensions. At the end of the network, a global average pooling layer, followed by a fully connected layer and a softmax activation function, is employed for diabetic retinopathy classification. Due to its capability of extracting hierarchical image features, ResNet-50 can successfully detect both global retinal structures and local lesions and is hence appropriate for retinal image analysis.</p>
        </sec>
        <sec id="S3.SS3.SSS2">
          <label>3.3.2</label>
          <title>Vision Transformer (ViT)</title>
          <p>
            <fig id="F3">
              <label>Figure 3.</label>
              <caption>
                <p>ViT Architecture.</p>
              </caption>
              <graphic xlink:href="fig3.png"/>
            </fig>
          </p>
          <p id="S3.SS3.SSS2.p1">The Vision Transformer (ViT) model as shown in Figure <xref ref-type="fig" rid="F3">3</xref>, used in this study is a transformer-based model for image classification that effectively captures global and local contextual information. In contrast to standard Convolutional Neural Networks (CNNs) that use convolutional layers, ViT processes images as a sequence of non-overlapping patches and uses self-attention mechanisms to learn feature representations. The input image resized to size <inline-formula><mml:math alttext="224\times 224\times 3" display="inline"><mml:mrow><mml:mn>224</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>224</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:math></inline-formula> is divided into <inline-formula><mml:math alttext="16\times 16" display="inline"><mml:mrow><mml:mn>16</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>16</mml:mn></mml:mrow></mml:math></inline-formula> patches, and 196 patches are tiled in a grid of size <inline-formula><mml:math alttext="14\times 14" display="inline"><mml:mrow><mml:mn>14</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>14</mml:mn></mml:mrow></mml:math></inline-formula>. The patch is flattened and projected into a 768-dimensional embedding using a linear projection layer. For preserving spatial relations, learnable positional embeddings are incorporated with patch embeddings. In addition, another classification token (CLS token) is appended at the end of the input sequence, which summarizes information from all the patches and goes through self-attention. The most critical part of the model is multiple transformer encoder layers, each containing Multi-Head Self-Attention (MHSA) and Feed-Forward Neural Networks (FFN). The MHSA process enables the model to capture long-range dependencies between image regions, and hence the model is particularly effective in diabetic retinopathy feature detection. Layer Normalization (LN) applies to each transformer block for input normalization and applies Dropout for overfitting prevention. Following the transformer layers, the CLS token output is subjected to a fully connected layer and softmax activation function to classify the image into one of five grades of diabetic retinopathy severity. The ViT model is pre-trained on large datasets before fine-tuning on the diabetic retinopathy dataset to enhance classification accuracy. Its ability to replicate world feature dependencies supports its efficiency in detecting subtle retinal pathologies, such that it is particularly superior to CNNs, especially in the context of a vast dataset.</p>
        </sec>
        <sec id="S3.SS3.SSS3">
          <label>3.3.3</label>
          <title>ResNet-50 + Vision Transformer (ViT) Hybrid Model</title>
          <p>
            <fig id="F4">
              <label>Figure 4.</label>
              <caption>
                <p>ResNet-50 + ViT Architecture.</p>
              </caption>
              <graphic xlink:href="fig4.png"/>
            </fig>
          </p>
          <p id="S3.SS3.SSS3.p1">The ResNet-50 + ViT model as shown in Figure <xref ref-type="fig" rid="F4">4</xref>, leverages the strengths of Convolutional Neural Networks (CNNs) in local feature extraction and Vision Transformers (ViTs) for global contextual information capture. The model has two main components:</p>
          <p id="S3.SS3.SSS3.p2">CNN Feature Extractor: Convolutional backbone, i.e., ResNet-50, converts the input image to capture local features such as microaneurysms, hemorrhages, and exudates. The ReLU activations and batch normalization with max-pooling are utilized in the convolutional layers to down-sample the spatial dimensions. The feature maps resulting from CNN are flattened and reshaped into sequential patches to be used as input for the transformer module.</p>
          <p id="S3.SS3.SSS3.p3">ViT Transformer Encoder: The ViT block consumes these feature patches obtained from CNN as a sequence and utilizes Multi-Head Self-Attention (MHSA) for learning the long-range dependencies. A classification token (CLS token) and position embeddings help learn inter-relation across different retinal areas. The final classification layer consists of a fully connected layer and softmax activation function for predicting diabetic retinopathy severity.</p>
          <p id="S3.SS3.SSS3.p4">This combination strategy effectively integrates CNNs to learn fine-grained spatial features and the self-attention mechanism of ViT to learn global dependencies, resulting in improved classification accuracy.</p>
          <p>
            <table-wrap id="T1">
              <label>Table 1</label>
              <caption>
                <p>Comparison of all four models.</p>
              </caption>
              <table>
                <thead>
                  <tr>
                    <th style="border-top: 1px solid black;" align="left">Model</th>
                    <th style="border-top: 1px solid black;" align="center">Training Accuracy</th>
                    <th style="border-top: 1px solid black;" align="center">Training Loss</th>
                    <th style="border-top: 1px solid black;" align="center">Validation Accuracy</th>
                    <th style="border-top: 1px solid black;" align="center">Validation Loss</th>
                  </tr>
                </thead>
                <tbody>
                  <tr>
                    <th style="border-top: 1px solid black;" align="left">ResNet-50</th>
                    <td style="border-top: 1px solid black;" align="center">98.72%</td>
                    <td style="border-top: 1px solid black;" align="center">0.0363</td>
                    <td style="border-top: 1px solid black;" align="center">75.40%</td>
                    <td style="border-top: 1px solid black;" align="center">1.6320</td>
                  </tr>
                  <tr>
                    <th align="left">ViT</th>
                    <td align="center">99.93%</td>
                    <td align="center">0.0030</td>
                    <td align="center">83.90%</td>
                    <td align="center">1.2849</td>
                  </tr>
                  <tr>
                    <th align="left">ResNet-50 + ViT</th>
                    <td align="center">99.26%</td>
                    <td align="center">0.0235</td>
                    <td align="center">88.40%</td>
                    <td align="center">0.8906</td>
                  </tr>
                  <tr>
                    <th style="border-bottom: 1px solid black;" align="left">EfficientNet-B0+ViT</th>
                    <td style="border-bottom: 1px solid black;" align="center">98.91%</td>
                    <td style="border-bottom: 1px solid black;" align="center">0.0331</td>
                    <td style="border-bottom: 1px solid black;" align="center">87.00%</td>
                    <td style="border-bottom: 1px solid black;" align="center">0.8766</td>
                  </tr>
                </tbody>
              </table>
            </table-wrap>
          </p>
        </sec>
        <sec id="S3.SS3.SSS4">
          <label>3.3.4</label>
          <title>EfficientNet-B0 + ViT Hybrid Model</title>
          <p>
            <fig id="F5">
              <label>Figure 5.</label>
              <caption>
                <p>EfficientNet-B0 + ViT Architecture.</p>
              </caption>
              <graphic xlink:href="fig5.png"/>
            </fig>
          </p>
          <p id="S3.SS3.SSS4.p1">EfficientNet-B0 + ViT model as shown in Figure <xref ref-type="fig" rid="F5">5</xref>, incorporates EfficientNet-B0 and a Vision Transformer (ViT) to generate a lightweight but efficient framework for detecting diabetic retinopathy. EfficientNet-B0 utilizes compound scaling to balance depth, width, and resolution in achieving maximum accuracy with the aim of ensuring computational efficiency. It begins with a <inline-formula><mml:math alttext="3\times 3" display="inline"><mml:mrow><mml:mn>3</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:math></inline-formula> convolutional stem, <inline-formula><mml:math alttext="3\times 3" display="inline"><mml:mrow><mml:mn>3</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:math></inline-formula> batch normalization, and Swish activation, and employs MBConv blocks with squeeze-and-excitation (SE) mechanisms to effectively extract retinal abnormalities. The extracted features are sequentially flattened into patches and fed into the ViT transformer encoder, which executes Multi-Head Self-Attention (MHSA), feed-forward layers, and layer normalization to capture global dependencies. A classification token (CLS token) gathers significant information, which is then passed on to a softmax activated fully connected layer to predict the image belonging to one out of five grades of diabetic retinopathy severity. With their excellent local feature extraction and global contextual learning, this hybrid model synergistically enhances classification accuracy with low computational costs.</p>
        </sec>
      </sec>
      <sec id="S3.SS4">
        <label>3.4</label>
        <title>Training Process</title>
        <p id="S3.SS4.p1">Training the ResNet-50, ViT, ResNet-50 + ViT, and EfficientNet-B0 + ViT models includes preprocessing data, augmentation, optimization, and evaluation. First, images are resized to <inline-formula><mml:math alttext="224\times 224" display="inline"><mml:mrow><mml:mn>224</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>224</mml:mn></mml:mrow></mml:math></inline-formula>, normalized between <inline-formula><mml:math alttext="[0,1]" display="inline"><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:math></inline-formula>, randomly flipped, rotated (<inline-formula><mml:math alttext="\pm 20^{\circ}" display="inline"><mml:mrow><mml:mo>±</mml:mo><mml:msup><mml:mn>20</mml:mn><mml:mo>∘</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula>), zoomed at (<inline-formula><mml:math alttext="0.8x--1.2x" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:mn>0.8</mml:mn><mml:mi>x</mml:mi><mml:mo rspace="0em">−</mml:mo><mml:mo lspace="0em">−</mml:mo><mml:mn>1.2</mml:mn><mml:mi>x</mml:mi></mml:mrow></mml:math></inline-formula>), contrast modified, and added with Gaussian noise. The dataset is split into 70% training, 15% validation, and 15% testing. The models are optimized using the Adam optimizer (learning rate = <inline-formula><mml:math alttext="0.0001" display="inline"><mml:mn>0.0001</mml:mn></mml:math></inline-formula>, decayed by cosine annealing) with categorical cross-entropy loss and class weighting. Training is performed with a batch size of 64 for 50 epochs with dropout (<inline-formula><mml:math alttext="0.5" display="inline"><mml:mn>0.5</mml:mn></mml:math></inline-formula>), L2 weight decay (<inline-formula><mml:math alttext="1e-4" display="inline"><mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>⁢</mml:mo><mml:mi>e</mml:mi></mml:mrow><mml:mo>−</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:math></inline-formula>), batch normalization (for CNNs), and layer normalization (for ViTs). Gradient clipping (max norm = 1) is performed in ViTs to prevent exploding gradients. During the forward pass, CNNs extract local hierarchical features and ViTs employ self-attention to learn global dependencies. The output logits are passed through softmax, and the loss is backpropagated using automatic differentiation. We track training and validation loss, as well as training and validation accuracy, and use early stopping if there is no improvement after 10 epochs. Learning Rate Scheduler reduces the learning rate by 0.1 each time it finds a plateau, to achieve stable convergence.</p>
      </sec>
    </sec>
    <sec id="S4">
      <label>4.</label>
      <title>Results</title>
      <p id="S4.p1">This section presents the results of the four models—ResNet-50, Vision Transformer (ViT), ResNet-50 + ViT, and EfficientNetB0 + ViT—on diabetic retinopathy classification. The models were trained on the dataset, and their performance was evaluated as shown in Table <xref rid="T1" ref-type="table">1</xref>, based on key metrics such as accuracy, loss, precision, recall, and validation performance.</p>
      <sec id="S4.SS1">
        <label>4.1</label>
        <title>ResNet-50</title>
        <p id="S4.SS1.p1">ResNet-50 model is used for training datasets to evaluate their performance parameters. By using this model, we achieved training accuracy of 98.72%, training loss of 0.0363, validation accuracy of 75.40% and it gives validation loss of 1.6320.These results are shown in the form of graph as shown in Figure <xref ref-type="fig" rid="F6">6</xref>.</p>
        <p>
          <fig id="F6">
            <label>Figure 6.</label>
            <caption>
              <p>ResNet-50.</p>
            </caption>
            <graphic xlink:href="fig6.png"/>
          </fig>
        </p>
      </sec>
      <sec id="S4.SS2">
        <label>4.2</label>
        <title>Vision Transformer (ViT)</title>
        <p id="S4.SS2.p1">ViT model is used for training datasets to evaluate their performance parameters. By using this model, we achieved training accuracy of 99.93%, training loss of 0.0030, validation accuracy of 83.90% and it gives validation loss of 1.2849.These results are shown in the form of graph as shown in Figure <xref ref-type="fig" rid="F7">7</xref>.</p>
        <p>
          <fig id="F7">
            <label>Figure 7.</label>
            <caption>
              <p>ViT.</p>
            </caption>
            <graphic xlink:href="fig7.png"/>
          </fig>
        </p>
      </sec>
      <sec id="S4.SS3">
        <label>4.3</label>
        <title>ResNet-50 + Vision Transformer (ViT) Hybrid Model</title>
        <p id="S4.SS3.p1">The Hybrid model of ResNet-50 and ViT is used for training datasets to evaluate their performance parameters. By using this model, we achieved training accuracy of 99.26%, training loss of 0.0235, validation accuracy of 88.40% and it gives validation loss of 0.8906.These results are shown in the form of the graph as shown in Figure <xref ref-type="fig" rid="F8">8</xref>.</p>
        <p>
          <fig id="F8">
            <label>Figure 8.</label>
            <caption>
              <p>ResNet-50 + ViT.</p>
            </caption>
            <graphic xlink:href="fig8.png"/>
          </fig>
        </p>
      </sec>
      <sec id="S4.SS4">
        <label>4.4</label>
        <title>EfficientNet-B0 + ViT Hybrid Model</title>
        <p id="S4.SS4.p1">The Hybrid model of EfficientNetB0 and ViT is used for training datasets to evaluate their performance parameters. By using this model, we achieved training accuracy of 98.91%, training loss of 0.0331, validation accuracy of 87.00% and it gives validation loss of 0.8766.These results are shown in the form of graph as shown in Figure <xref ref-type="fig" rid="F9">9</xref>.</p>
        <p>
          <fig id="F9">
            <label>Figure 9.</label>
            <caption>
              <p>EfficientNet-B0 + ViT.</p>
            </caption>
            <graphic xlink:href="fig9.png"/>
          </fig>
        </p>
      </sec>
    </sec>
    <sec id="S5">
      <label>5.</label>
      <title>Conclusion</title>
      <p id="S5.p1">The results of this study demonstrate that hybrid models combining CNNs and Vision Transformers outperform standalone models in diabetic retinopathy classification, which is a complex medical image classification task. Among the four models tested, the ResNet-50 + ViT hybrid emerged as the best-performing model, achieving the highest validation accuracy (88.40%) and the lowest validation loss (0.8906). This model's ability to leverage both local feature extraction and global context understanding enables it to capture subtle and complex patterns in retinal images, making it the most effective for detecting diabetic retinopathy.</p>
      <p id="S5.p2">The EfficientNetB0 + ViT hybrid model, while slightly less accurate, still provided excellent performance with a validation accuracy of 87.00%. The advantage of this model lies in its efficiency—its reduced computational cost makes it ideal for resource-constrained environments, such as mobile or edge computing devices, without sacrificing too much in terms of classification accuracy.</p>
      <p id="S5.p3">In comparison, the standalone CNN model (ResNet-50) struggled with generalization, achieving lower validation accuracy (75.40%) and higher validation loss (1.6320), while the standalone ViT model showed better performance (83.90% validation accuracy), demonstrating the importance of global context learning in this task.</p>
      <p id="S5.p4">Overall, this study confirms that combining CNNs with Vision Transformers provides a robust and efficient solution for diabetic retinopathy classification, leveraging the strengths of both architectures. These hybrid models represent a promising direction for future research and practical deployment in medical image analysis, particularly in diagnosing diabetic retinopathy, where both local abnormalities and global contextual information are critical for accurate detection. Future work could explore additional optimizations, such as hybridizing other state-of-the-art models or incorporating attention mechanisms that are specifically designed for medical image tasks.</p>
    </sec>
  </body>
  <back>
    <ack>
      <title>Acknowledgments</title>
      <p id="ack.p1">This work was supported without any funding.</p>
    </ack>
    <sec id="sec0100" sec-type="COI-statement">
      <title>Conflict of interest</title>
      <p>The authors declare no conflicts of interest.</p>
    </sec>
    <ref-list>
      <title>References</title>
      <ref id="ref001">
        <label>[1]</label>
        <mixed-citation> Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N., &amp; Folk, J. C. (2018). Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. <italic>NPJ digital medicine, 1</italic>(1), 39. [<uri>https://doi.org/10.1038/s41746-018-0040-6</uri>] </mixed-citation>
      </ref>
      <ref id="ref002">
        <label>[2]</label>
        <mixed-citation> Ting, D. S. W., Cheung, G. C. M., &amp; Wong, T. Y. (2016). Diabetic retinopathy: global prevalence, major risk factors, screening practices and public health challenges: a review. <italic>Clinical </italic>&amp;<italic> experimental ophthalmology, 44</italic>(4), 260-277. [<uri>https://doi.org/10.1111/ceo.12696</uri>] </mixed-citation>
      </ref>
      <ref id="ref003">
        <label>[3]</label>
        <mixed-citation> S.D. Karthik Maggie. APTOS 2019 Blindness Detection. <italic>Kaggle</italic> (2019). </mixed-citation>
      </ref>
      <ref id="ref004">
        <label>[4]</label>
        <mixed-citation> Agarwal, R. (2023, November). Diabetic retinopathy segmentation in IDRiD using enhanced U-Net. In <italic>2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE)</italic> (pp. 1-6). IEEE. [<uri>https://doi.org/10.1109/AIKIIE60097.2023.10390434</uri>] </mixed-citation>
      </ref>
      <ref id="ref005">
        <label>[5]</label>
        <mixed-citation> Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … &amp; Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. <italic>arXiv preprint arXiv:2010.11929</italic>. [<uri>https://arxiv.org/pdf/2010.11929/1000</uri>] </mixed-citation>
      </ref>
      <ref id="ref006">
        <label>[6]</label>
        <mixed-citation> Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., … &amp; Webster, D. R. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. <italic>jama, 316</italic>(22), 2402-2410. [<uri>https://doi.org/10.1001/jama.2016.17216</uri>] </mixed-citation>
      </ref>
      <ref id="ref007">
        <label>[7]</label>
        <mixed-citation> Zhai, X., Kolesnikov, A., Houlsby, N., &amp; Beyer, L. (2022). Scaling vision transformers. In <italic>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</italic> (pp. 12104-12113). [<uri>https://doi.org/10.48550/arXiv.2106.04560</uri>] </mixed-citation>
      </ref>
      <ref id="ref008">
        <label>[8]</label>
        <mixed-citation> Kobat, S. G., Baygin, N., Yusufoglu, E., Baygin, M., Barua, P. D., Dogan, S., … &amp; Acharya, U. R. (2022). Automated diabetic retinopathy detection using horizontal and vertical patch division-based pre-trained DenseNET with digital fundus images. <italic>Diagnostics, 12</italic>(8), 1975. [<uri>https://doi.org/10.3390/diagnostics12081975</uri>] </mixed-citation>
      </ref>
      <ref id="ref009">
        <label>[9]</label>
        <mixed-citation> Tanlikesmath. Diabetic Retinopathy Detection Competition Dataset Resized/Cropped (2019). </mixed-citation>
      </ref>
      <ref id="ref010">
        <label>[10]</label>
        <mixed-citation> Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … &amp; Polosukhin, I. (2017). Attention is all you need. <italic>Advances in neural information processing systems, 30</italic>. [<uri>https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref011">
        <label>[11]</label>
        <mixed-citation> Li, T., Gao, Y., Wang, K., Guo, S., Liu, H., &amp; Kang, H. (2019). Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. <italic>Information Sciences, 501</italic>, 511-522. [<uri>https://doi.org/10.1016/j.ins.2019.06.011</uri>] </mixed-citation>
      </ref>
      <ref id="ref012">
        <label>[12]</label>
        <mixed-citation> Staal, J., Abràmoff, M. D., Niemeijer, M., Viergever, M. A., &amp; Van Ginneken, B. (2004). Ridge-based vessel segmentation in color images of the retina. <italic>IEEE transactions on medical imaging, 23</italic>(4), 501-509. [<uri>https://doi.org/10.1109/TMI.2004.825627</uri>] </mixed-citation>
      </ref>
      <ref id="ref013">
        <label>[13]</label>
        <mixed-citation> Quellec, G., Charriere, K., Boudi, Y., Cochener, B., &amp; Lamard, M. (2017). Deep image mining for diabetic retinopathy screening. <italic>Medical image analysis, 39</italic>, 178-193. [<uri>https://doi.org/10.1016/j.media.2017.04.012</uri>] </mixed-citation>
      </ref>
      <ref id="ref014">
        <label>[14]</label>
        <mixed-citation> Szegedy, C., Ioffe, S., Vanhoucke, V., &amp; Alemi, A. (2017, February). Inception-v4, inception-resnet and the impact of residual connections on learning. In <italic>Proceedings of the AAAI conference on artificial intelligence</italic> (Vol. 31, No. 1). [<uri>https://doi.org/10.1609/aaai.v31i1.11231</uri>] </mixed-citation>
      </ref>
      <ref id="ref015">
        <label>[15]</label>
        <mixed-citation> Decencière, E., Zhang, X., Cazuguel, G., Lay, B., Cochener, B., Trone, C., … &amp; Klein, J. C. (2014). Feedback on a publicly distributed image database: the Messidor database. <italic>Image Analysis </italic>&amp;<italic> Stereology</italic>, 231-234. [<uri>https://dx.doi.org/10.5566/ias.1155</uri>] </mixed-citation>
      </ref>
      <ref id="ref016">
        <label>[16]</label>
        <mixed-citation> Sugeno, A., Ishikawa, Y., Ohshima, T., &amp; Muramatsu, R. (2021). Simple methods for the lesion detection and severity grading of diabetic retinopathy by image processing and transfer learning. <italic>Computers in biology and medicine, 137</italic>, 104795. [<uri>https://doi.org/10.1016/j.compbiomed.2021.104795</uri>] </mixed-citation>
      </ref>
      <ref id="ref017">
        <label>[17]</label>
        <mixed-citation> Usman, T. M., Saheed, Y. K., Ignace, D., &amp; Nsang, A. (2023). Diabetic retinopathy detection using principal component analysis multi-label feature extraction and classification. <italic>International Journal of Cognitive Computing in Engineering, 4</italic>, 78-88. [<uri>https://doi.org/10.1016/j.ijcce.2023.02.002</uri>] </mixed-citation>
      </ref>
      <ref id="ref018">
        <label>[18]</label>
        <mixed-citation> Willis, J. R., Doan, Q. V., Gleeson, M., Haskova, Z., Ramulu, P., Morse, L., &amp; Cantrell, R. A. (2017). Vision-related functional burden of diabetic retinopathy across severity levels in the United States. <italic>JAMA ophthalmology, 135</italic>(9), 926-932. [<uri>https://doi.org/10.1001/jamaophthalmol.2017.2553</uri>] </mixed-citation>
      </ref>
      <ref id="ref019">
        <label>[19]</label>
        <mixed-citation> Porwal, P., Pachade, S., Kamble, R., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., &amp; Meriaudeau, F. (2018). Indian diabetic retinopathy image dataset (IDRiD): a database for diabetic retinopathy screening research.  <italic>Data, 3</italic>(3), 25. [<uri>https://doi.org/10.3390/data3030025</uri>] </mixed-citation>
      </ref>
      <ref id="ref020">
        <label>[20]</label>
        <mixed-citation> Wu, Y., Xia, Y., Song, Y., Zhang, Y., &amp; Cai, W. (2020). NFN+: A novel network followed network for retinal vessel segmentation. <italic>Neural Networks, 126</italic>, 153-162. [<uri>https://doi.org/10.1016/j.neunet.2020.02.018</uri>] </mixed-citation>
      </ref>
      <ref id="ref021">
        <label>[21]</label>
        <mixed-citation> Hu, J., Shen, L., &amp; Sun, G. (2018). Squeeze-and-excitation networks. In <italic>Proceedings of the IEEE conference on computer vision and pattern recognition </italic>(pp. 7132-7141). [<uri>https://doi.org/10.48550/arXiv.1709.01507</uri>] </mixed-citation>
      </ref>
      <ref id="ref022">
        <label>[22]</label>
        <mixed-citation> Song, J., Zheng, Y., Wang, J., Zakir Ullah, M., &amp; Jiao, W. (2021). Multicolor image classification using the multimodal information bottleneck network (MMIB-Net) for detecting diabetic retinopathy. <italic>Optics Express, 29</italic>(14), 22732-22748. [<uri>https://doi.org/10.1364/OE.430508</uri>] </mixed-citation>
      </ref>
      <ref id="ref023">
        <label>[23]</label>
        <mixed-citation> Li, X., Hu, X., Yu, L., Zhu, L., Fu, C. W., &amp; Heng, P. A. (2019). CANet: cross-disease attention network for joint diabetic retinopathy and diabetic macular edema grading. <italic>IEEE transactions on medical imaging, 39</italic>(5), 1483-1493. [<uri>https://doi.org/10.1109/TMI.2019.2951844</uri>] </mixed-citation>
      </ref>
      <ref id="ref024">
        <label>[24]</label>
        <mixed-citation> Mo, J., Zhang, L., &amp; Feng, Y. (2018). Exudate-based diabetic macular edema recognition in retinal images using cascaded deep residual networks. <italic>Neurocomputing, 290</italic>, 161-171. [<uri>https://doi.org/10.1016/j.neucom.2018.02.035</uri>] </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>
