2026/03/28

Survival Analysis Primer: Kaplan-Meier Curves, Log-rank Tests, and Cox Regression

A practical guide to survival analysis for clinical researchers — when to use it, how to prepare your data, and how to interpret KM curves and Cox regression results.

Your study compares outcomes between two groups of patients, and the endpoint is "time from surgery to recurrence." Some patients relapsed, some were still recurrence-free at the last follow-up, and some were lost to follow-up. You cannot simply use a t-test to compare average times between groups — because for the patients who did not relapse, you do not know what their true recurrence time would have been.

This is why you need survival analysis.

What is survival analysis?

Survival analysis is a family of statistical methods designed to handle time-to-event data. The "event" does not have to be death — it can be any outcome of interest:

Tumor recurrence
Disease progression
Postoperative complication
Graft failure
Patient death

The core advantage of survival analysis is that it correctly handles censored data — individuals who have not yet experienced the event by the end of the observation period. If you exclude these patients, you introduce serious bias. If you treat their follow-up time as an event time, your results are equally wrong. Survival analysis provides a mathematical framework to make proper use of this incomplete information.

What format does your data need?

For survival analysis, each patient needs at least two variables:

Time variable: Duration from the starting point to event occurrence or censoring. The starting point is typically the date of diagnosis, surgery, or enrollment. Units can be days, months, or years, but must be consistent
Status variable (event indicator): Marks whether the patient experienced the event. Typically coded as 1 = event occurred, 0 = censored

For example:

Patient ID	Follow-up (months)	Event status	Group
001	24	1 (recurrence)	Treatment
002	36	0 (censored)	Control
003	12	1 (recurrence)	Treatment
004	30	0 (censored)	Treatment

What counts as censoring?

The patient has not experienced the event by the end of the study
The patient was lost to follow-up
The patient withdrew for reasons unrelated to the study event (e.g., relocation, refusal to continue)

The most common data preparation errors are inconsistent time calculations (some measured from diagnosis, others from surgery) and inaccurate censoring status. Verify carefully before starting analysis.

The Kaplan-Meier method

The Kaplan-Meier (KM) method is the most fundamental and widely used survival analysis tool. It estimates the survival function — the probability that a patient has not yet experienced the event at any given time point t.

How to read a KM curve

The x-axis of a KM curve is time, and the y-axis is survival probability (0 to 1, or 0% to 100%):

The curve starts at 1.0 (100%) in the upper left
Each time a patient experiences an event, the curve drops by a step
Censored observations are typically marked with small tick marks or plus signs on the curve — the curve does not drop at these points, but the number at risk decreases
A flatter curve indicates a lower event rate and better prognosis
Greater separation between two curves indicates a larger difference between groups

Median survival time

The median survival time is the time value where the KM curve crosses the 50% survival probability line. It means that half of the patients experienced the event before this time point.

If the curve stays above 50% throughout (meaning more than half of patients did not experience the event during observation), the median survival time cannot be calculated — this is common in studies with favorable prognosis.

Number at risk table

A properly formatted KM plot includes a risk table below the curve showing how many patients remain "at risk" at each time point. This table matters because estimates in the later portion of the curve are often based on very few patients and have low precision. If only 5 patients remain at a given time point, fluctuations in the curve are not reliable.

The Log-rank test

KM curves show visual differences, but you need a statistical test to determine whether the difference between groups is statistically significant.

The Log-rank test is the standard method. It works by comparing the observed versus expected number of events at each event time point across groups.

Null hypothesis: The survival curves of the two groups are identical
Output: A chi-square statistic and p-value
Assumption: The hazard ratio between groups is approximately constant over the follow-up period (i.e., the KM curves do not cross)

If the KM curves cross (for example, one treatment works better short-term but worse long-term), the Log-rank test loses power. In such cases, consider alternative tests (e.g., Wilcoxon test or piecewise analysis).

Cox proportional hazards regression

The Log-rank test tells you whether two groups differ but not how large the difference is, and it cannot adjust for confounders. That is where Cox regression comes in.

Cox proportional hazards regression is the most important multivariable method in survival analysis. Its output is the hazard ratio (HR):

HR = 1: Equal risk in both groups
HR > 1: The factor increases event risk (worse prognosis)
HR < 1: The factor decreases event risk (better prognosis)

For example, "HR for the treatment group versus control = 0.62 (95% CI: 0.45–0.85, p = 0.003)" means that after adjusting for other variables, the treatment group had a 38% lower risk of the event compared to controls.

The proportional hazards assumption

The key assumption of Cox regression is the proportional hazards assumption: the hazard ratio between groups remains constant throughout the follow-up period.

Methods for checking this include:

Schoenfeld residual test: A significant p-value indicates violation of the proportional hazards assumption
Visual inspection of KM curves: If curves cross, the assumption is likely violated

If the assumption does not hold, consider stratifying by time or using a time-dependent Cox model.

Multivariable Cox regression

In practice, Cox regression is typically reported in two stages:

Univariable analysis: Test each variable individually against the outcome and select significant ones (usually p < 0.1 or p < 0.2 as inclusion criteria)
Multivariable analysis: Enter all selected variables simultaneously to obtain adjusted HRs

Multivariable Cox regression results are usually presented as forest plots with HR on the x-axis (log scale) and a reference line at HR = 1. This is one of the most common result presentations in clinical papers.

Common pitfalls

1. Inconsistent starting points

Some patients are measured from diagnosis date, others from surgery date. The starting point must be clearly defined in the study design and strictly consistent in the data.

2. Informative censoring

If a patient is lost to follow-up because their condition worsened and they transferred to another hospital, this censoring is related to the event itself, violating the fundamental assumption of survival analysis. The impact of this bias should be discussed in the paper.

3. Insufficient sample size

Cox regression typically requires at least 10–20 events per predictor variable. If you have only 30 events total, you can include at most 2–3 variables. Including too many variables leads to model overfitting.

4. Reporting p-values without HRs

Many novice researchers write "the difference was statistically significant (p < 0.05)" without reporting the HR and 95% CI. Reviewers will almost certainly request this information.

The manual workflow problem

Running survival analysis in SPSS requires manually setting time and event variables, iteratively building models, and manually formatting KM plots. R is more flexible but has a steeper learning curve — the parameters for the survival and survminer packages alone take time to master.

The presentation of survival analysis results also involves many details: KM plots need risk tables, Cox regression needs forest plots, and you need to report proportional hazards assumption tests. Each detail requires additional code and formatting work.

How Data2Paper fits into this workflow

Data2Paper includes a complete survival analysis module. Upload a clinical data file containing time and event status variables, and the system automatically detects the data structure, generates Kaplan-Meier survival curves with risk tables, runs Log-rank tests, builds Cox regression models, and outputs journal-ready figures and interpretation text.

No coding required, no switching between statistical software and your document — upload the data, describe the research question, and get complete results ready for your manuscript.

Upload your clinical data and start generating your paper →

All Posts

Author

Data2Paper Team

Survival Analysis Primer: Kaplan-Meier Curves, Log-rank Tests, and Cox Regression

A practical guide to survival analysis for clinical researchers — when to use it, how to prepare your data, and how to interpret KM curves and Cox regression results.

This is why you need survival analysis.

What is survival analysis?

Survival analysis is a family of statistical methods designed to handle time-to-event data. The "event" does not have to be death — it can be any outcome of interest:

Tumor recurrence
Disease progression
Postoperative complication
Graft failure
Patient death

What format does your data need?

For survival analysis, each patient needs at least two variables:

Time variable: Duration from the starting point to event occurrence or censoring. The starting point is typically the date of diagnosis, surgery, or enrollment. Units can be days, months, or years, but must be consistent
Status variable (event indicator): Marks whether the patient experienced the event. Typically coded as 1 = event occurred, 0 = censored

For example:

Patient ID	Follow-up (months)	Event status	Group
001	24	1 (recurrence)	Treatment
002	36	0 (censored)	Control
003	12	1 (recurrence)	Treatment
004	30	0 (censored)	Treatment

What counts as censoring?

The patient has not experienced the event by the end of the study
The patient was lost to follow-up
The patient withdrew for reasons unrelated to the study event (e.g., relocation, refusal to continue)

The Kaplan-Meier method

How to read a KM curve

The x-axis of a KM curve is time, and the y-axis is survival probability (0 to 1, or 0% to 100%):

The curve starts at 1.0 (100%) in the upper left
Each time a patient experiences an event, the curve drops by a step
Censored observations are typically marked with small tick marks or plus signs on the curve — the curve does not drop at these points, but the number at risk decreases
A flatter curve indicates a lower event rate and better prognosis
Greater separation between two curves indicates a larger difference between groups

Median survival time

The median survival time is the time value where the KM curve crosses the 50% survival probability line. It means that half of the patients experienced the event before this time point.

Number at risk table

The Log-rank test

KM curves show visual differences, but you need a statistical test to determine whether the difference between groups is statistically significant.

The Log-rank test is the standard method. It works by comparing the observed versus expected number of events at each event time point across groups.

Null hypothesis: The survival curves of the two groups are identical
Output: A chi-square statistic and p-value
Assumption: The hazard ratio between groups is approximately constant over the follow-up period (i.e., the KM curves do not cross)

Cox proportional hazards regression

The Log-rank test tells you whether two groups differ but not how large the difference is, and it cannot adjust for confounders. That is where Cox regression comes in.

Cox proportional hazards regression is the most important multivariable method in survival analysis. Its output is the hazard ratio (HR):

HR = 1: Equal risk in both groups
HR > 1: The factor increases event risk (worse prognosis)
HR < 1: The factor decreases event risk (better prognosis)

The proportional hazards assumption

The key assumption of Cox regression is the proportional hazards assumption: the hazard ratio between groups remains constant throughout the follow-up period.

Methods for checking this include:

Schoenfeld residual test: A significant p-value indicates violation of the proportional hazards assumption
Visual inspection of KM curves: If curves cross, the assumption is likely violated

If the assumption does not hold, consider stratifying by time or using a time-dependent Cox model.

Multivariable Cox regression

In practice, Cox regression is typically reported in two stages:

Univariable analysis: Test each variable individually against the outcome and select significant ones (usually p < 0.1 or p < 0.2 as inclusion criteria)
Multivariable analysis: Enter all selected variables simultaneously to obtain adjusted HRs

All Posts

Author

Data2Paper Team

Survival Analysis Primer: Kaplan-Meier Curves, Log-rank Tests, and Cox Regression

Author

Categories

More Posts

Beyond SPSS: A Modern Alternative for Survey Data Analysis

Reliability Analysis and Cronbach's Alpha: A Practical Guide for Researchers

Regression and Mediation Analysis: Automate Your Research Statistical Pipeline

Newsletter

Survival Analysis Primer: Kaplan-Meier Curves, Log-rank Tests, and Cox Regression

Author

Categories

More Posts

Beyond SPSS: A Modern Alternative for Survey Data Analysis

Reliability Analysis and Cronbach's Alpha: A Practical Guide for Researchers

Regression and Mediation Analysis: Automate Your Research Statistical Pipeline

Newsletter