To apply machine learning algorithms to generate models capable of predicting mortality in COVID-19 patients, using a large platform of plasma inflammatory mediators.
DesingProspective, descriptive, cohort study.
Setting6 intensive care units in 2 hospitals in Southern Brazil.
PatientsPatients aged > 18 years who were diagnosed with COVID-19 through reverse transcriptase reaction or rapid antigen test.
InterventionsNone.
Main variables of interestDemographic and clinical variables, 65 inflammatory biomarkers, mortality.
ResultsCombinations of two or three proteins yield higher predictive value when compared to individual proteins or the full set of the 65 proteins. A proliferation-inducing ligand (APRIL) and cluster of differentiation 40 ligand (CD40L) consistently emerge among the highest-ranking combinations, suggesting a potential synergistic effect in predicting clinical outcomes. The network structure suggested a dysregulated immune response in non-survivors characterized by the failure of regulatory cytokines to temper an overwhelming inflammatory reaction.
ConclusionOur results highlight the value of feature selection and careful consideration of biomarker combinations to improve prediction accuracy in COVID-19 patients.
Aplicar algoritmos de aprendizaje automático para generar modelos capaces de predecir la mortalidad en pacientes con COVID-19, utilizando una amplia plataforma de mediadores inflamatorios plasmáticos.
DiseñoEstudio de cohorte prospectivo, descriptivo.
Ámbito6 unidades de cuidados intensivos en 2 hospitales del sur de Brasil.
PacientesPacientes mayores de 18 años diagnosticados con COVID-19 mediante reacción de transcriptasa inversa o prueba rápida de antígeno.
IntervencionesNinguna.
Variables de interés principalesVariables demográficas y clínicas, 65 biomarcadores inflamatorios, mortalidad.
ResultadosLas combinaciones de dos o tres proteínas muestran un mayor valor predictivo en comparación con proteínas individuales o el conjunto completo de las 65 proteínas. APRIL y CD40 L consistentemente aparecen entre las combinaciones de mayor rango, lo que sugiere un posible efecto sinérgico en la predicción de los resultados clínicos. La estructura de la red sugiere una respuesta inmunitaria desregulada en los no sobrevivientes, caracterizada por la incapacidad de las citocinas regulatorias para moderar una reacción inflamatoria abrumadora.
ConclusiónNuestros resultados destacan la importancia de la selección de características y la cuidadosa consideración de combinaciones de biomarcadores para mejorar la precisión predictiva en pacientes con COVID-19.
At the end of 2019, a new coronavirus, named SARS-CoV-2, broke out in China, triggering a health crisis of global proportions.1 COVID-19 spread around the world, affecting millions of people and causing unprecedented damage to the global economy.2 The development of vaccines has had a positive impact on mitigating the social and economic effects of COVID-19.3 However, COVID‐19 disease can still lead to hospitalisation and mortality. In this context, it is extremely important to identify biomarkers capable of predicting clinical outcomes related to the disease.
A hallmark COVID-19 is the flawed immune response in hosts, characterized by an excessive pro-inflammatory reaction known as the "cytokine storm." This unbalanced response leads to various complications, including acute respiratory distress syndrome,4 but the underlying mechanisms are yet to be defined. Several different inflammatory biomarkers were studied trying to address the relation between inflammation and COVID-19 severity. Those include cytokines, coagulation and complement proteins,5–8 giving insights into COVID-19 pathogenesis and better patient prognostication, and these advances provides the development of specific anti-cytokine therapies in severe COVID-19.9 Despite these advances, cytokine identification as biomarkers has not been consistent across studies10 and prospective studies measuring large inflammatory biomarkers of different immune responses could improve our understanding of the pathogenesis of severe COVID-19. To this end, in this study, we applied machine learning algorithms to generate models capable of predicting mortality in COVID-19 patients, using a large platform of plasma inflammatory mediators.
Patients and methodsStudy designA prospective cohort was conducted with patients admitted to 6 intensive care units (ICUs) in 2 hospitals in Southern Brazil between June 2020 and November 2020, and the variants that predominated in Brazil during this period were Alpha, Zeta and Gamma.11 The study was performed according to the Declaration of Helsinki and the Brazilian National Health Council Resolution 466.
ParticipantsPatients aged > 18 years who were diagnosed with COVID-19 through reverse transcriptase reaction or rapid antigen test were consecutively included. Patients with severe chronic diseases or diseases that alter the inflammatory response, such as those with long-term use of immunosuppressants, with cancer without disease control, with human immunodeficiency virus infection without disease control, and who received palliative care, or with a life expectancy of less than 24 h as judged by the attending physician were excluded.
ProceduresVenous blood samples were collected within 24 h after ICU admission. Sociodemographic and clinical information was collected directly from the patient, their surrogate, or the electronic medical records. The levels of 65 proteins (see Supplementary material for full list) were measured using the Immune Monitoring 65-Plex Human ProcartaPlex™ in a Luminex MAGPIX® system. Final protein concentrations were calculated using the online Procarta Plex Analysis Application.
Statistical analysisThe raw data includes the plasma concentration of 65 proteins in 213 individuals. This data was divided into a training set, containing data from 168 individuals (80.10%), and a validation set containing data from 42 individuals (19.90%) (For more details on the generation of the training and validation sets, please refer to the Supplementary material).
For each marker combination, an Elastic Net Regression model was trained based on the training set. The prediction is made on the test set, and a ROC curve was constructed to determine the threshold that optimizes the balance between sensitivity and specificity for outcome predictions (For more details on the regression, please refer to the Supplementary material).
Network and centrality analysisSpearman correlation coefficients and p-values were calculated between all proteins. The resulting correlation data was analysed into Cytoscape (For more details on the network analysis, please refer to the Supplementary material).
ResultsFlowchart of patients is shown in Fig. 1. The demographic data, comorbidities, disease characteristics at hospital admission and clinical outcomes of the 213 patients in the analytical sample are listed in Table 1. The in-hospital mortality was 25% (53 of 213 patients), as we previous published.8,12
Clinical characteristics of included patients.
Survivor (n = 160) | Non-Survivor (n = 53) | p-value | |
---|---|---|---|
Gender, male, n (%) | 61 (38) | 16 (30) | 0.32 |
Age, mean (SD) | 53 (15) | 62 (13) | <0.001 |
Need for mechanical ventilation, n (%) | 46 (29) | 37 (70) | <0.001 |
Thorax CT scan extension of lesions > 50%, n (%) | 51 (32) | 28 (53) | 0.023 |
Charlson comorbidity index, mean (SD) | 1.62 (1.5) | 2.34 (1.65) | 0.004 |
BMI, mean (SD) | 26 (10) | 29 (5) | 0.026 |
Corticosteroid use, n (%) | 122 (76) | 44 (83) | 0.30 |
SOFA at admission, mean (SD) | 2.78 (2.2) | 5.28 (3.6) | <0.001 |
SAPS III score, mean (SD) | 39 (25) | 58 (20) | <0.001 |
Respiratory SOFA, mean (SD) | 1.98 (1) | 2.43 (1) | 0.009 |
A full model incorporating all 65 proteins achieved a predictive accuracy of 72.1% for determining mortality, with a sensitivity of 69.5%, a positive predictive value (PPV) of 96.2%, a negative predictive value (NPV) of 35.3%, and an F1-score of 80.6%. For better applicability in the clinical setting, we aimed to simplify the model, using a smaller number of proteins, thus it was trained a logistic regression model to identify markers with non-zero coefficients, resulting in a list of 62 out of the initial 65 proteins, from which growth-regulated protein alpha (Gro-alpha), hepatocyte growth factor (HGF) and interferon-inducible T cell alpha chemoattractant (I-TAC) were removed. The new model showed a slight performance improvement, with an accuracy of 76.7%, sensitivity of 75.0%, PPV of 96.4%, NPV of 40.0%, an F1-score of 84.4%.
It was further evaluated the predictive ability of each individual protein, and it was observed that some individual proteins can generate models with predictive capacity similar or even better than the full model. Fig. 2A shows the prediction parameters of the top 10 models trained with individual proteins. Interleukin (IL)-10, lipocalin blc, and nerve growth factor (NGF)-beta have higher average parameters than the more complex models. To investigate the synergistic and possible interaction effects of these proteins the predictive ability of all possible combinations of 2–3 proteins (generating 39773 combinations) was determined. Combinations of two proteins yield higher accuracy, sensitivity, and PPV compared to individual proteins (Fig. 2B). For example, the combination of cluster of differentiation 40 ligand (CD40L) and macrophage migration inhibitory factor (MIF) achieves an accuracy of 86.0%, and an F1-score of 92.3%. Three-protein models further improve predictive performance, with the top combinations achieving an accuracy of 90.7% and an F1-score of 94.7% (Fig. 2C). Notably, a proliferation-inducing ligand (APRIL) and CD40L consistently emerge among the highest-ranking combinations, suggesting a potential synergistic effect in predicting clinical outcomes. In all three-protein combinations when APRIL and CD40L appeared they both have positive regression coefficient (positive association with the outcome), having CD40L always a higher one, and the third protein had always a negative correlation coefficient (negative association with the outcome) (e-Figure 1). From the 26 proteins that appeared at least once in these combinations 14 were increased (e-Figure 2), 10 were unaltered (e-Figure 3) and only 2 (e-Figure 4) were decreased in non-survivors. Interestingly, CD40L, epithelial neutrophil-activating protein 78 (ENA-78), IL-1β, IL-31, and leukemia inhibitory factor (LIF), despite not being significantly different between survivors and non-survivors, are included among the top predictive combinations. This finding highlights the importance of synergistic interactions between features in the development of predictive models. One plausible explanation for the fact that several proteins did not significantly differ between survivors and non-survivors but appeared in the prediction models could be their correlation to some disease severity markers. So, a correlation heatmap was constructed between these proteins and age, body mass index (BMI) and the severity of organs dysfunction (evaluated by the SOFA score), but it was only finding a weak correlation between CD40L and BMI (e-Figure 5).
To better explore the interactions between these proteins a correlation network was constructed for both survivors and non-survivors (Fig. 3A and B). The network in non-survivors (Fig. 3A) shows strong interconnections in a small core of proteins that included IL-2, IL-4, IL-6, IL-17, IL-18, interferon γ-induced protein 10 (IP-10), macrophage-derived chemokine (MDC) and MIF. All correlations had a coefficient higher than 0.8 and highly significant p-values (lower than 0.00001). This core is connected to more distant proteins by IL-23, macrophage colony-stimulating factor (MCSF), macrophage inflammatory protein (MIP-1beta) that interact with each other with higher correlation coefficients. Interestingly, APRIL and CD40L that appears among the highest-ranking combinations in the regression analysis did not correlate directly. CD40L only directly interacts with TNF-related apoptosis inducing ligand (TRAIL) (r = 0.79, p < 0.000001), and TRAIL interacts weakly with APRIL (r = 0.51, p = 0.00009). In the network from survivors’ patients (Fig. 3B), the central core has weaker correlations between proteins and included only one coefficient higher than 0.8 (IL-23 connecting with MCSF, r = 0.80). MIF has different interactions with a coefficient around 0.75 including MIF-MDC, MIF-IL-4, MIF-IL-17A and MIF-IL-18. Interestingly, CD40L again only interacts with TRAIL, but with weaker correlation coefficient (r = 0.61), and CD40L-TRAIL did not connect with the rest of the network in survivors. Additionally, thymic stromal lymphopoietin (TLSP), IL-10, Eotaxin and ENA78 emerged exclusively in non-survivors, indicating an increase diversity of inflammatory signalling responses co-activation. These correlations illustrate the complex interplay among inflammatory markers, reflecting immune response pathways that impact differently survivors and non-survivors. A more detailed analysis of the network structure is provided in Supplementary material.
Correlation network.
A correlation network was constructed using the 26 most relevant proteins for survival prediction in (A) survivors, (B) non-survivors’ COVID-19 patients.
The edges between individual proteins represent their correlation index. The strength of the correlation is indicated by a colour gradient, with shades of red representing positive correlations and shades of blue representing negative correlations. Stronger correlations are depicted with more intense colors, while weaker correlations appear in lighter shades.
Our results reveal a differentiated interaction between individual biomarkers, pairs, and trios, and shows that specific combinations have the potential to outperform individual proteins and even highly complex models. Of the 65 proteins measured, only a few consistently emerge among the highest-ranking combinations. Furthermore, out of more than 39,773 possible proteins combinations, approximately 20 were selected for their high accuracy. Additionally, these results give us insights on the pathophysiology of COVID-19 when comparing inflammatory protein network in survivors and non-survivors’ patients.
It is well described in the literature the association of plasma levels of different inflammatory mediators and COVID-19 severity and mortality. A recent meta-analysis determined the accuracy of routine laboratory tests to predict mortality, but none of the evaluated tests have enough accuracy to suggest it use in clinical practice.13 Therefore, various approaches should be explored to improve prognostication and to elucidate the pathophysiological pathways that are relevant to disease progression. Filbin et al.14 analysed 1472 plasma proteins that predicts disease severity (death or need for mechanical ventilation) with and area under the ROC curve of 0.85. Among the strongest weighted proteins in the predictor were IL-6, IL-1RL1, pentraxin (PTX)-3, and IL-1RN. Su et al.15 tested the association of 4701 circulating proteins founding that the proteomic model was able to predict severe (defined as individuals who died or required any form of oxygen supplementation) and critical (defined as individuals who died or experienced severe respiratory failure requiring non-invasive ventilation, high flow oxygen therapy, intubation, or extracorporeal membrane oxygenation) with and area under the ROC curve of 86% and 80% respectively. To predict severe COVID-19, the best performing protein model selected 92 proteins along, and to predict critical COVID-19 67 proteins were selected. Although these results are significant, the feasibility of conducting large-scale proteomic analyses in the current clinical setting remains limited. Therefore, we opted to utilize a more cost-effective and widely accessible technique by employing a commercially available Luminex kit to quantify well-established inflammatory mediators.
The most relevant finding of our results is that models including 2 or 3 proteins have better performance when compared to more complex models with up to 65 proteins, and some of these proteins were not even individually different comparing survivors and non-survivors. Most of the protein combinations demonstrating higher performance included APRIL and CD40L, both of which exhibited positive regression coefficients. APRIL is a member of the TNF ligand family, and is important for B cell development, being implicated in the development of immune mediated diseases including IgA nephropathy16 and asthma.17 Mucosal immunity is important during the response to SARS-CoV-218 and APRIL is crucial in activating local B-cell responses and antibody production and promoting the formation of inducible bronchus-associated lymphoid tissues.19 During both respiratory viral infection APRIL is up regulated20,21 and was increased in the plasma of patients with COVID-19.22 In contrast, scRNA-seq data of cells from BALF demonstrated that macrophages expressed higher levels of B cell activating factor (BAFF) but not of APRIL in severe forms of disease.23 Thus, the APRIL-BAFF system seemed to be relevant in SARS-CoV-2 infection, and our results reinforces this idea. APRIL is increased in non-survivors’ patients and have a positive regression coefficient in several different 2 or 3 protein models. In addition, in the network from non-survivors APRIL have central role with widespread influence across the network.
CD40L is a protein that is primarily expressed on activated T cells and, as APRIL, is a member of the TNF superfamily of molecules. CD40L/CD40 interactions exert profound effects on both innate and acquired immunity and although B-cell activation can occur in absence of CD40 signals, many immune functions are defective in the absence of this interaction.24 During viral infections the CD40 pathway restricted virus replication.25,26 In COVID-19 patients CD40L is significantly elevated in early stages of the disease.27,28 Furthermore, as we demonstrated here CD40L levels were not elevated in more severe forms of COVID-19 compared to milder cases.29 This underscores the central conclusion of our findings: prognostic and pathophysiological studies of biomarkers should prioritize protein-protein interactions rather than focusing solely on individual protein levels. Our network analysis suggested a relevant interaction between CD40L and TRAIL when comparing survivors and non-survivors. TRAIL may have a dual function in the immune system, both to kill virally infected cells and in the regulation of cytokine production.30 Increased TRAIL expression occurs after infection with influenza, and an anti-TRAIL antibody decreased viral clearance.31 However, in this mouse model, treatment with anti-TRAIL antibody decreased viral clearance, suggesting that TRAIL has an in vivo antiviral effect in influenza viral infection. In COVID-19 patients, lower TRAIL levels were associated with worse outcomes, emphasizing its role in controlling viral clearance.32 We could only find such an interaction in cancer literature, where the co-expression of proteins improved antitumor responses.33
Another relevant protein in our network analysis that appeared in several different three-protein models with higher accuracy is LIF. LIF is a key stem cell growth factor required to maintain continued function of the blood–air barrier during infections. In animal models of pneumonia LIF was found to protect against severe disease34,35 and recombinant therapeutic LIF prevented acute respiratory distress syndrome.36 Recently, LIF was shown to be central in the regulation of tissue-localized versus systemic immunity and the balance between allergen and viral responsiveness in the lungs, resulting in enhanced antiviral immunity.37 Using a mendelian randomization analysis from 91 circulating inflammation-related proteins LIF was found one of four proteins that conferred risk on critical COVID-19.38
The primary strength of our study lies in the application of a robust machine learning tool to identify the most significant protein-protein interactions that offer the highest prognostic value, utilizing a commercially available Luminex Kit. Furthermore, our network analysis provides deeper insights into the complexity of cytokine interactions when comparing survivors and non-survivors among COVID-19 patients. However, it is difficult to determine whether these results would be reproducible in contemporary COVID-19 patients. Our sample was collected at the beginning of the pandemic, when no patient had been vaccinated. Thus, differences in SARS-CoV-2 variants and vaccination status could attenuate or even modify the inflammatory response. In a large cohort of individuals with symptomatic COVID-19, vaccinated individuals had lower concentrations of cytokines and chemokines than unvaccinated individuals from 1 to 8 days after symptom onset up to 90 days post-enrollment.39 Since disease severity and mortality in COVID-19 are generally attributed to an excessive inflammatory response, as suggested here, the reduced levels of inflammatory mediators in vaccinated individuals provide a mechanistic explanation for their lower symptom burden and mortality upon SARS-CoV-2 infection. However, this does not necessarily imply that vaccinated patients who develop severe disease do not exhibit exaggerated inflammatory responses. in vitro studies suggest that interferon responses to the Alpha, Beta, Delta, and Omicron SARS-CoV-2 variants are similar,40 although this is not a universal finding.41 Animal models of infection have indicated that the Alpha variant induces a more robust inflammatory response compared to the Beta variant,42 a finding further corroborated when comparing the Delta and Alpha variants.43 Thus, hyperinflammation appears to play a critical role in the development of severe COVID-19, regardless of vaccination status or viral variant. However, the magnitude of these responses may vary depending on these factors.
Several limitations should be considered when interpreting our results. First, both the training and validation cohorts were derived from the same population, and inflammatory responses may vary due to genetic and environmental factors, impacting in our results. Second, we did not validate protein levels using an alternative technique, such as ELISA, which would enhance the clinical relevance of our findings. Third, we did not include patients with bacterial or other pathogen-specific sepsis, which limits our ability to determine whether these responses are unique to COVID-19 or reflect a general response to severe infections. Fourth, our model did not incorporate clinical characteristics to interact with the biomarkers. Including such interactions could provide valuable insights into the relationship between specific proteins and patients’ clinical characteristics; however, it would also increase the complexity of the analysis. A previously published study8 on this patient cohort demonstrated that the predictive performance of core clinical characteristics was inferior to that of the biomarkers presented here in predicting mortality.
In summary, our model proves to be a practical and efficient solution for clinical applications. It highlights the value of careful consideration of biomarker combinations to improve prediction accuracy. In particular, the use of a smaller subset of proteins as biomarkers shows remarkable predictive power, enabling the development of cost-effective multiplex kits for widespread prediction of clinical outcomes for improving healthcare decision-making and personalized treatment strategies.
CRediT authorship contribution statementLFM made substantial contributions to the conception and design of the work, analysis, and interpretation of data for the work; drafting the work; and final approved the version to be published.
HRD-P, GP, CSG, LS, DPG, GAW, CR made substantial contributions to the design of the work the acquisition and analysis of data for the work and final approved the version to be published.
FD-P and JCFM made substantial contributions to the conception and design of the work, analysis, and interpretation of data for the work; revised the work for important intellectual content, and final approved the version to be published.
FundingMCTIC/CNPq/FNDCT/MS/SCTIE/DECIT, 07/2020, grant number 401263/2020-7.
BRF S.A. Hub unrestricted donation.
FAPERGS-PPSUS#21/2551-0000073-2.
Declaration of Generative AI and AI-assisted technologies in the writing processDuring the preparation of this work the authors did not use generative AI or AI-assisted technologies.
The following are Supplementary data to this article:
Comparation of plasma levels of individual proteins that were increased comparing survivors and non-survivors COVID-19 patients. Results are presented as median and interquartile range, along with minimum and maximum values. Mann-Whitney test statistics and p-values for the comparisons are displayed at the top of each individual illustration.
Comparation of plasma levels of individual proteins that were unaltered comparing survivors and non-survivors COVID-19 patients. Results are presented as median and interquartile range, along with minimum and maximum values. Mann-Whitney test statistics and p-values for the comparisons are displayed at the top of each individual illustration.
Comparation of plasma levels of individual proteins that were decreased comparing survivors and non-survivors COVID-19 patients. Results are presented as median and interquartile range, along with minimum and maximum values. Mann-Whitney test statistics and p-values for the comparisons are displayed at the top of each individual illustration.
Heat map of the correlations between proteins that did not differ between survivors and non-survivors and clinical characteristics. The magnitude of each correlation is denoted with a colour, whereby the red colour indicates a positive correlation, and the blue colour indicates a negative correlation.