
Survival Analysis Primer: Kaplan-Meier Curves, Log-rank Tests, and Cox Regression
A practical guide to survival analysis for clinical researchers — when to use it, how to prepare your data, and how to interpret KM curves and Cox regression results.
Your study compares outcomes between two groups of patients, and the endpoint is "time from surgery to recurrence." Some patients relapsed, some were still recurrence-free at the last follow-up, and some were lost to follow-up. You cannot simply use a t-test to compare average times between groups — because for the patients who did not relapse, you do not know what their true recurrence time would have been.
This is why you need survival analysis.
What is survival analysis?
Survival analysis is a family of statistical methods designed to handle time-to-event data. The "event" does not have to be death — it can be any outcome of interest:
- Tumor recurrence
- Disease progression
- Postoperative complication
- Graft failure
- Patient death
The core advantage of survival analysis is that it correctly handles censored data — individuals who have not yet experienced the event by the end of the observation period. If you exclude these patients, you introduce serious bias. If you treat their follow-up time as an event time, your results are equally wrong. Survival analysis provides a mathematical framework to make proper use of this incomplete information.
What format does your data need?
For survival analysis, each patient needs at least two variables:
- Time variable: Duration from the starting point to event occurrence or censoring. The starting point is typically the date of diagnosis, surgery, or enrollment. Units can be days, months, or years, but must be consistent
- Status variable (event indicator): Marks whether the patient experienced the event. Typically coded as 1 = event occurred, 0 = censored
For example:
| Patient ID | Follow-up (months) | Event status | Group |
|---|---|---|---|
| 001 | 24 | 1 (recurrence) | Treatment |
| 002 | 36 | 0 (censored) | Control |
| 003 | 12 | 1 (recurrence) | Treatment |
| 004 | 30 | 0 (censored) | Treatment |
What counts as censoring?
- The patient has not experienced the event by the end of the study
- The patient was lost to follow-up
- The patient withdrew for reasons unrelated to the study event (e.g., relocation, refusal to continue)
The most common data preparation errors are inconsistent time calculations (some measured from diagnosis, others from surgery) and inaccurate censoring status. Verify carefully before starting analysis.
The Kaplan-Meier method
The Kaplan-Meier (KM) method is the most fundamental and widely used survival analysis tool. It estimates the survival function — the probability that a patient has not yet experienced the event at any given time point t.
How to read a KM curve
The x-axis of a KM curve is time, and the y-axis is survival probability (0 to 1, or 0% to 100%):
- The curve starts at 1.0 (100%) in the upper left
- Each time a patient experiences an event, the curve drops by a step
- Censored observations are typically marked with small tick marks or plus signs on the curve — the curve does not drop at these points, but the number at risk decreases
- A flatter curve indicates a lower event rate and better prognosis
- Greater separation between two curves indicates a larger difference between groups
Median survival time
The median survival time is the time value where the KM curve crosses the 50% survival probability line. It means that half of the patients experienced the event before this time point.
If the curve stays above 50% throughout (meaning more than half of patients did not experience the event during observation), the median survival time cannot be calculated — this is common in studies with favorable prognosis.
Number at risk table
A properly formatted KM plot includes a risk table below the curve showing how many patients remain "at risk" at each time point. This table matters because estimates in the later portion of the curve are often based on very few patients and have low precision. If only 5 patients remain at a given time point, fluctuations in the curve are not reliable.
The Log-rank test
KM curves show visual differences, but you need a statistical test to determine whether the difference between groups is statistically significant.
The Log-rank test is the standard method. It works by comparing the observed versus expected number of events at each event time point across groups.
- Null hypothesis: The survival curves of the two groups are identical
- Output: A chi-square statistic and p-value
- Assumption: The hazard ratio between groups is approximately constant over the follow-up period (i.e., the KM curves do not cross)
If the KM curves cross (for example, one treatment works better short-term but worse long-term), the Log-rank test loses power. In such cases, consider alternative tests (e.g., Wilcoxon test or piecewise analysis).
Cox proportional hazards regression
The Log-rank test tells you whether two groups differ but not how large the difference is, and it cannot adjust for confounders. That is where Cox regression comes in.
Cox proportional hazards regression is the most important multivariable method in survival analysis. Its output is the hazard ratio (HR):
- HR = 1: Equal risk in both groups
- HR > 1: The factor increases event risk (worse prognosis)
- HR < 1: The factor decreases event risk (better prognosis)
For example, "HR for the treatment group versus control = 0.62 (95% CI: 0.45–0.85, p = 0.003)" means that after adjusting for other variables, the treatment group had a 38% lower risk of the event compared to controls.
The proportional hazards assumption
The key assumption of Cox regression is the proportional hazards assumption: the hazard ratio between groups remains constant throughout the follow-up period.
Methods for checking this include:
- Schoenfeld residual test: A significant p-value indicates violation of the proportional hazards assumption
- Visual inspection of KM curves: If curves cross, the assumption is likely violated
If the assumption does not hold, consider stratifying by time or using a time-dependent Cox model.
Multivariable Cox regression
In practice, Cox regression is typically reported in two stages:
- Univariable analysis: Test each variable individually against the outcome and select significant ones (usually p < 0.1 or p < 0.2 as inclusion criteria)
- Multivariable analysis: Enter all selected variables simultaneously to obtain adjusted HRs
Multivariable Cox regression results are usually presented as forest plots with HR on the x-axis (log scale) and a reference line at HR = 1. This is one of the most common result presentations in clinical papers.
Common pitfalls
1. Inconsistent starting points
Some patients are measured from diagnosis date, others from surgery date. The starting point must be clearly defined in the study design and strictly consistent in the data.
2. Informative censoring
If a patient is lost to follow-up because their condition worsened and they transferred to another hospital, this censoring is related to the event itself, violating the fundamental assumption of survival analysis. The impact of this bias should be discussed in the paper.
3. Insufficient sample size
Cox regression typically requires at least 10–20 events per predictor variable. If you have only 30 events total, you can include at most 2–3 variables. Including too many variables leads to model overfitting.
4. Reporting p-values without HRs
Many novice researchers write "the difference was statistically significant (p < 0.05)" without reporting the HR and 95% CI. Reviewers will almost certainly request this information.
The manual workflow problem
Running survival analysis in SPSS requires manually setting time and event variables, iteratively building models, and manually formatting KM plots. R is more flexible but has a steeper learning curve — the parameters for the survival and survminer packages alone take time to master.
The presentation of survival analysis results also involves many details: KM plots need risk tables, Cox regression needs forest plots, and you need to report proportional hazards assumption tests. Each detail requires additional code and formatting work.
How Data2Paper fits into this workflow
Data2Paper includes a complete survival analysis module. Upload a clinical data file containing time and event status variables, and the system automatically detects the data structure, generates Kaplan-Meier survival curves with risk tables, runs Log-rank tests, builds Cox regression models, and outputs journal-ready figures and interpretation text.
No coding required, no switching between statistical software and your document — upload the data, describe the research question, and get complete results ready for your manuscript.
Author

Categories
More Posts

AI-Powered Literature Reviews: How Data2Paper Generates Research Reports from a Topic
Data2Paper's Research Report feature turns a research topic into a structured literature review with real citations, thematic synthesis, and downloadable outputs in PDF, Word, and LaTeX.


Beyond SPSS: A Modern Alternative for Survey Data Analysis
A comparison of SPSS, Jamovi, JASP, and Data2Paper for survey data analysis — examining learning curves, automation, and end-to-end research workflows.


Regression and Mediation Analysis: Automate Your Research Statistical Pipeline
A practical guide to regression, mediation, and moderation analysis for survey research — including when to use each method and how automation changes the workflow.

Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates