HomeSurvival Analysis Tutorial

This quiz consists of 0 mandatory questions where you can gather 0 points.
You will complete the quiz by answering all the questions and gathering at least 0 points.

Answered: 0 / 0

Achieved points: 0 (NaN %)

Survival Analysis Tutorial

A Self-Paced Tutorial on Visual Analytics for Survival Analysis


Welcome to the Survival Analysis Tutorial! In this tutorial, you'll learn the intricacies of analyzing and interpreting survival data, an increasingly important skill in biomedicine. Whether you're investigating the efficacy of new treatments, understanding disease progression at the molecular level, or exploring the lifespan of cells under various conditions, survival analysis is an indispensable tool. This tutorial provides a unique opportunity to challenge yourself and brush up on your data science skills. We'll walk you through the key concepts of survival analysis, and you'll learn about censoring and key visualizations of survival data. By the end of the tutorial, you'll know how to identify important features that can separate your data into groups with different survival probabilities, identify potential gene markers, and find a set of genes that affect survival. Your journey to mastering survival analysis begins here!

We will be using Orange; you can install it by visiting its download page.

The tutorial has four chapters, each including video lectures and links to the lecture notes. There is a short quiz at the end of each chapter.

Please complete a demographic background questionnaire before you begin the tutorial and a tutorial evaluation questionnaire after you have completed all the chapters. Both questionnaires are optional. Your answers will help us improve the software and the tutorial.

Please note:

  • For each of the four chapters, once you have reviewed the material, click "Start Quiz Section" to take the quiz.
  • You must answer all questions in a chapter before moving on to the next chapter.
  • Your progress is automatically saved. However, once you begin a chapter, try to complete it in one sitting. You can return to the tutorial at any time using the link sent to you in the email, but always aim to complete a section without interruption.
  • Once you've completed all the chapters, you'll be asked to submit your answers. Before submitting your answers, please complete the pre- and post-test questionnaires.
  • Participation in the tutorial is completely voluntary and you can stop at any time. In the top right corner of the menu, there's an option to "clear user data". Selecting it will delete all of your answers.

If you have any questions or need further clarification, please feel free to contact us on our discord server.

We offer this material under Creative Commons CC BY-NC-ND licence.

Pre-test Questionnaire

Please fill out this demographic background questionnaire. This should only take a minute of your time. While this section is optional, your answers will help us understand the audience for this tutorial and improve the content. Thank you very much!

What is your approximate age?


What is your eployment status?


What is the highest level of education you have compleated? If you are still a student, please specify the highest degree you have recieved.


What is (or was) your area of study (e.g., computer science, molecular biology)?

If applicable, which sector do you currently work in (e.g., healthcare, technology, education, finance)?

What is your region of the world (e.g. United States, Slovenia, ...)?

Some groups are currently underrepresented in data science careers. If you identify as a member of any of these groups, please select the checkbox(es) next to the group name below.


Please indicate your prior access to data science education [select all that apply]:


Please indicate your current access to data science education [select all that apply]:


On a scale of 1 (completely unfamiliar) to 5 (very knowledgeable), how would you rate your familiarity with Orange?

On a scale of 1 (completely unfamiliar) to 5 (very knowledgeable), how would you rate your familiarity with survival analysis?

Have you ever used software libraries or other tools for survival analysis? If yes, please list them below.

Chapter 1: Survival Data and Survival Curve

What is special about survival data? What is censoring? How do we visually represent changes in survival probabilities over time? In this chapter, we will go through some basic concepts of survival analysis. We will work with a toy example of survival data and show how to plot the survival curve, one of the key visualizations in survival analysis.

Please start by watching the following video:

Warmup Questions

If the video was too fast for you, or if you want to dive deeper into the concepts covered, we have also prepared a supplemental notebook. The topic of the Data & Survival Curve video is covered in Chapter 1. Please note that this is a complementary resource. To answer the warmup questions, going through the introductory video should suffice.

Survival Data And Estimation Of The Survival Curve

Consider the following data:

In our graph, the subjects (A, N, F, ...) are in rows. The start of the blue line tells us when we started observing the subject; this could be the time of the intervention, for example. The end of the blue line tells us when we last checked or observed an event. Observed events are marked with an X, all other data are censored. For example, we started observing subject K in week 5, and the event occurred in week 8. The total survival time for K is therefore 3 weeks. The survival time for N is 4 weeks, and at the last visit the event has not yet occurred.

Use the graph above and construct a data table ready for survival analysis. Use Excel or a similar spreadsheet program. Feed the data into Orange and answer the following questions:

By definition, median survival is the time at which half of the participants in a study population are expected to experience the event of interest. In Orange, you can view its value in the Kaplan-Meier widget.

Congratulations, this concludes the quiz for this section!

Before proceeding to the next section, please take a minute to give us some feedback on this introductory section of the tutorial. While answering these questions is optional, your answers will help us improve the software and the tutorial.

Chapter 2: Exploring Survival Features

In this chapter, you will learn how to create groups of samples (subjects, patients) using additional features from the data set and how to assess whether these groups showcase a difference in survival. You will also learn how to automatically rank and find the data features that best correlate with survival.

To get started, watch the video below to see how to group your data and compare the survival curves of different groups:

Please also watch the video about survival-based ranking of data features:

We also cover the topic from these two videos in Chapter 2 of the accompanying notebook.

These two videos were about the creation and comparison of groups using both categorical and continuous features. We delved into the process of generating new data cohorts, comparing and visualizing survival curves, and implementing Kaplan-Meier plots using visual programming in Orange.

Exploring Survival Differences Between Cohorts Of Patients

Here we take a look at the U.S. Veteran's Administration Lung Cancer Trial (from Kalbfleisch D and Prentice RL, 1980), in which male patients with advanced, inoperable lung cancer received either standard therapy or investigational chemotherapy. The data set includes 137 patients, 9 of whom left the study before death. The study was designed to assess the benefit of test chemotherapy and analyze the effects of other covariates.

Your first task is to load the data into Orange and view the survival curve for the entire cohort of patients.

You can easily access this study's data in Orange using the Datasets widget. Search for "Veterans" and the dataset should appear at the top. This data already includes the information on time and event features (time since diagnosis [months], survival event), and you can feed the data to the Kaplan-Meier plot directly from the Datasets widget; you don't need to use the As Survival widget here, although it doesn't hurt.

The next task is to split the data into two cohorts. Our lung cancer trial dataset contains a variable called "Treatment" that indicates whether patients received standard chemotherapy or an alternative, test chemotherapy. We can use this variable to split the patient into two cohorts, the control and the test group. Use the Kaplan-Meier widget to answer the following questions:

While you can calculate the desired summary statistics by combining some of the other widgets in Orange, you can get the answer to the questions here directly with the Kaplan-Meier widget by grouping the data in this widget accordingly. The graph legend reports the counts, where "N" in the legend refers to the size of the cohort, and "n" refers to the number of uncensored (surviving) patients.

Using Different Features for Splitting Patients into Cohorts

Let us analyze the effect of other variables in this data set. How does age affect patient survival? How is survival affected by the Karnofsky performance score?

The Karnofsky Performance Score is a scale used to quantify a patient's general well-being and ability to carry out daily activities. Its typical values are 10 to 30 for fully hospitalized patients, 40 to 60 partially hospitalized patients, and 70 to 90 for patients able to care for themselves. Observe the differences between survival for two cohorts of patients split by the median value for age, and similar for two cohorts based on Karnofsky performance score. You can use the Discretize widget and "Equal frequency" discretization for grouping by the median value.

Congratulations, this concludes the quiz for this section!

Before proceeding to the next section, we again ask you to give us some feedback, this time on this current section on exploring of survival features. Thank you!

Chapter 3: Ranking Genes

What if our data contains thousands of features, for example, as is the case in some molecular biology datasets? How do we approach survival analysis in such cases? Here, you will learn how to use gene expression data for survival analysis. We will show you how to identify genes that are most predictive of survival.

Please start by watching the video below:

This topics from our video are also covered in the Chapter 3 of the companion notebook.

In the video, we explored Orange's bioinformatics tools for analyzing and interpreting gene expression datasets, focusing on the relationship between gene expression and survival in breast cancer. Using the METABRIC study as an example, we demonstrated how to identify survival marker genes and understand their role in cancer prognosis.

Expression Data In Survival Analysis

Let’s look at the cervical cancer data from The Cancer Genome Atlas (TCGA). The dataset is available in the Datasets widget under “TCGA-CESC”. It includes survival data and gene expression values for 306 patients with cervical cancer.

In the literature, genes associated with regulating the UV response have been researched as potential therapeutic targets for cervical cancer treatment (e.g., Gu et al., 2019).

Your task is to identify a handful of potential marker genes associated with survival. For simplicity, you will look only at genes associated with a down-regulated ultraviolet radiation (UV) response. Oncogenes from the human papillomavirus (HPV), the leading cause of cervical cancer, have a complicated relationship with cellular response to UV.

The gene set associated with a down-regulated UV response can be found in the Gene Sets widget under the Molecular Signatures Database (MSigDB) among the Hallmark Gene Sets under the name "HALLMARK_UV_RESPONSE_DN".

Gene Ranking

Do not forget to install the Bioinformatics add-on first, and use Gene Sets widget from this add-on to answer the question.

Now load the TCGA-CESC data and consider only the genes that make up the down-regulated UV response hallmark gene set.

Gene Expression-Based Cohort Analysis

Create two patient groups based on the expression of the highest-ranked gene from the set of downregulated UV response genes. Then, chart the corresponding survival curves.

To answer the following quiz questions, you will have to assemble the longest data analysis pipeline yet, using widgets such as Genes, Gene Sets, Rank Survival Features, Discretize, and Kaplan-Meier. You will become a survival analysis expert!

We again assume that, given a gene, cohorts are created by splitting the patients according to the median gene expression.

Congratulations, this concludes the quiz for this section!

Please give us your feedback on the Gene Ranking chapter of our tutorial. Thank you!

Chapter 4: Ranking gene sets

There may be many, possibly hundreds of genes that affect survival. For interpretation purposes, it may be helpful to consider only groups of genes instead, such as genes from the same pathway. Here, we will learn how to evaluate the effect of a collection of genes on survival and how to rank gene sets according to their association with survival.

Please start by watching the following video:

The topics from the video are also covered in the Chapter 4 of the accompanying notebook.

In the video, we learned how to quantify the relationship between gene sets and survival. The tutorial guided us through the process of loading, preprocessing, and evaluating the data to understand how the expression of specific gene sets correlates with patient survival.

Gene Sets Ranking

In 2019, Xia et al. published a paper identifying a new set of proteins that cause endogenous DNA damage when overproduced. The researchers discovered these DNA damage-up proteins (DDPs) in E. coli and identified 284 human homologs that are overrepresented among known cancer drivers.

They constructed three sets of genes: a set of all 284 candidate DDPs, a set of DDPs excluding known cancer drivers, and a set of DDPs excluding validated DDPs. Known cancer drivers are those whose gain- or loss-of-function in driving cancer has been established in the literature. Validated DDPs, on the other hand, refer to a subset of DDPs identified as actual DNA damage initiators in human cells by Xia et al. They evaluated the association of the three DDP gene sets with overall survival by calculating a gene set enrichment score for each sample and comparing two cohorts, one above and one below the top tertiles. Using any of the three DDP gene sets resulted in significant survival differences between the formed cohorts. These results indicate that there are genes among the discovered human DDP candidates that were previously unknown to drive cancer.

Check whether the three DDP gene sets (all, known excluded, validated excluded) are associated with decreased overall survival in the BRCA dataset, even when you split the patients into cohorts by the median of the enrichment scores (Xia and co-authors used tertiles for splitting, see above).

The BRCA dataset is available in the Datasets widget under the name “TCGA-BRCA”. Loading the dataset might take a minute.

The gene sets are available here. Use the worflow with File, Genes, and Gene Sets widget (see the image below) to load the gene sets into Orange. Once you load custom gene sets into the Gene sets widget, continue using the widget like you usually would. Alternatively, you can download a pre-constructed workflow and open it with Orange.

Remember to pass the TCGA-BRCA data through the As Survival Data widget. You will need to use the Single Sample Scoring widget to score the gene set. After scoring, remove the gene expression information using Select Columns and move the gene set scores to the Features section. You can score each gene set individually using Discretize and Kaplan-Meier combination, or just use the Rank Survival Features widget to calculate the p-values for the differences in the survival curves.

We can use the same procedure to check the utility of other gene sets. Your next task is to rank all fifty gene sets from the Hallmark Gene Set collection with regards to overall survival in the BRCA dataset (order by p-value).

Your workflow from the previous question is ready to answer this question as well; just switch to the Hallmark Gene Sets in the Gene Sets widget. Remember to seected all of the listed gene sets, as the Gene Sets widgets output only the setected sets. You will also need to move the computed scores in the Select Columns widget from the Metas panel to the Features.

The 3rd and 4th ranking gene sets Hallmark Gene Set collection are associated with regulating DNA damage. Form groups based on these two gene sets and plot the corresponding survival curves.

Here, you could stay with the workflow from the previous two questions, select the 3rd and 4th gene sets in the Rank Survival Features widget, and than pass it to Discretize to create an indicator variable for the patient cohorts. Make sure you discretize only the two gene sets, and not the Time variable.

Congratulations, this completes the quiz for this chapter and all the topic quizzes in this tutorial!

Again, we ask you to answer the following questions to help us improve our tutorial in the future. Thank you!

Post-test Questionnaire

To help us improve tools we have created for this tutorial and Orange Data Mining software, please fill out the survey below.

Have you successfully learned the core concepts of survival analysis? Given the data, would you be able to recognize the situation when applying survival analysis?

On a scale of 1 (completely unfamiliar) to 5 (very knowledgeable), how would you rate your familiarity with survival analysis after completing this tutorial?

Would it be enough to read about survival analysis in a book?

How much did using hands-on examples and visual programming help you follow the study material? Please rate on a scale from 1 (not helpful at all) to 5 (extremely helpful).

Please also help us by answering the following series of questions about the Orange Data Mining Toolbox used in this tutorial. Your answers will help us to evaluate and improve the usability of Orange.

I think that I would like to use this software frequently (1-strongly agree, 5-strongly disagree).

I found the software unnecessarily complex (1-strongly agree, 5-strongly disagree).

I thought the software was easy to use (1-strongly agree, 5-strongly disagree).

I think that I would need the support of a technical person to be able to use this software (1-strongly agree, 5-strongly disagree).

I found the various functions in this software were well integrated (1-strongly agree, 5-strongly disagree).

I thought there was too much inconsistency in this software (1-strongly agree, 5-strongly disagree).

I would imagine that most people would learn to use this software very quickly (1-strongly agree, 5-strongly disagree).

I found the software very cumbersome to use (1-strongly agree, 5-strongly disagree).

I felt very confident using the software (1-strongly agree, 5-strongly disagree).

I needed to learn a lot of things before I could get going with this software (1-strongly agree, 5-strongly disagree).

Thank you!

You made it through. Great, congratualitions. If you have answered all the mandatory quiz question, there shuold be a submit button at the bottom of your browsing window. Press submit to finish the quiz and claim the reward.