Survival analysis is a popular field of statistics and deals with many tasks both in statistical inference and prediction. While comparison of survival curves as one of the typical inferential tasks is performed using the log-rank test and other approaches, prediction of time to an event of interest for a given individual is commonly made using Cox proportional hazard model or others. However, all the methods in the survival toolbox are limited by relatively strict statistical assumptions, which... zobrazit celý abstraktSurvival analysis is a popular field of statistics and deals with many tasks both in statistical inference and prediction. While comparison of survival curves as one of the typical inferential tasks is performed using the log-rank test and other approaches, prediction of time to an event of interest for a given individual is commonly made using Cox proportional hazard model or others. However, all the methods in the survival toolbox are limited by relatively strict statistical assumptions, which violations may bias the results of the techniques applied to real data. In this work, we address the issue that the commonly used methods, both in statistical inference and prediction, are limited by their assumptions, and improve them using robust approaches, particularly machine-learning algorithms and delta method. In general, machine-learning approaches do not require to meet so strict assumptions; that is the reason we may get more robust alternatives to the traditional techniques. While the log-rank test or Cox proportional hazards model (or others) might be used within statistical inference in survival analysis for comparing two or more groups represented by their survival curves, we investigate rather tree-based methods for the same task and derive some new statistical properties of this approach. Intuitively, a random forest containing a large proportion of trees with sufficient complexity, adjusted by tree pruning, can classify individuals from various groups into two or more classes depicted by their survival curves, which tends to reject the null hypothesis about no statistical difference between the curves. Thus, a proportion of trees with sufficient complexity classifying into two or more groups, depicted by their survival curves, is very close to the p-value estimate as an analogy of the classical Wald's t-test output of the Cox's regression. We denote the p-value's analogy as phi-value. Furthermore, a level of the pruning of decision trees the random forest model is built with can reduce the tree complexity and, therefore, modify the frequency of null hypothesis false rejection output by the random forest alternative. Also, survival curves could be approximately compared using confidence intervals around the Kaplan-Meier estimator for the survival probability of different groups. So, using the delta method, we adjust the formula for the variance of the Kaplan-Meier estimator for particular cases when information about an event of interest is uncertain, e.g., not appropriately updated in time. Regarding prediction in survival analysis, the Cox model is limited by relatively strict statistical assumptions. So, we propose decomposing the time-to-event variable into "time" and "event" components and using the latter as a target variable for various machine-learning classification algorithms, which are almost assumption-free, unlike the Cox model. While the time component is continuous and is used as one of the covariates, i.e., input variables for various classification algorithms such as logistic regression, naïve Bayes classifiers, decision trees, random forests, and artificial neural networks, the event component is binary, thus, may be modeled using these classification algorithms. We further present simulations demonstrating how the random-forest-based method's rate of false null hypothesis rejection decreases with the increasing tree pruning level. Finally, the adjusted Kaplan-Meier estimation and time-to-event decomposition is applied to predict a decrease or non-decrease of IgG and IgM blood antibodies against COVID-19 (SARS-CoV-2), respectively, below a laboratory cut-off, for a given individual at a given time point. Based on the analytical derivations, simulations, and real-world data applications, the introduced methods seem to enrich the family of all alternatives for survival curves' comparison and time-to-event prediction, and, even more, some of them require a minimum of statistical assumptions needed to be met. |