This statistical analysis aims to investigate the biological factors that influence the development of heart disease, and the symptoms that can be indicative or predictive of heart disease development. In order to be able to accurately and preemptively predict heart disease in patients, a clear understanding of the physiological attributes that contribute to the development of heart disease is necessary. Further, understanding how these various traits can interact, influence, or correlate with each other can contribute to our understanding of the biological mechanisms that underlie blood vessel damage, and heart disease development. Attributes such as cholesterol level, blood pressure, blood sugar, age, heart rate, and and chest pain have all been implicated in some form with heart disease, but investigation is needed to determine if they are truly associated with the onset of heart disease, and if the incidences of these traits are correlated.
The dataset used in this analysis contains data from 303 individuals’ data from the Cleveland Clinic (Janosi, Andras, et al., 1989). Data was collected from patients referred to the Cleveland Clinic for coronary angiography between May 1981 and September 1984 by Robert Detrano, M.D., Ph.D of the Cleveland Clinic Foundation. The dataset contains 14 columns, from a larger set of 75 variables, which represent 13 different physiological attributes obtained from patients, and the final variable indicating the presence or absence of heart disease, called the “target” variable.
The variables included in this dataset are as follows: age, sex, chest pain, blood pressure at rest, serum cholesterol, resting electrocardiogram, fasting blood sugar, maximum heart rate, oldpeak, slope, exercise induced angina, thallium heart rate test result, and target (which indicates the presence or absence of heart disease).
Details of the variables are included in Table 1.
Variable name | Details | Variable Type |
---|---|---|
age | Patient age in years | Numeric, ratio |
sex | Patient biological sex (Female:0; Male:1) | Nominal, binary |
cp | Type of chest pain experience by the patient (0: typical angina; 1: atypical angina; 2: non-anginal pain; 3: asymptomatic) | Nominal |
trestbps | Patient resting blood pressure (mm/Hg) | Numeric, ratio |
chol | Serum cholesterol (mg/dl) | Numeric, ratio |
fbs | Fasting blood sugar > 120 mg/dl (0: False; 1: True) | Nominal, binary |
restecg | Resting electrocardiographic results (0: normal; 1: ST-wave abnormality; 2: probable or definite left ventricular hypertrophy) | Nominal |
thalach | Maximum heart rate achieved (bpm) | Numeric, ratio |
exang | Exercise induced angina (0: no angina; 1: angina present) | Nominal, binary |
oldpeak | Stress test depression induced by exercise relative to rest | Numeric, ratio |
slope | The slope of the peak exercise ST segment (0 = upsloping; 1: flat; 2: downsloping | Nominal |
ca | Number of major vessels coloured by fluoroscopy (0-3) | Ordinal |
thal | Thallium heart rate test results (1: normal blood flow; 2: fixed defect (no blood flow in some part of the heart); 3: reversible defect (abnormal blood flow) | Nominal |
target | Indicates the presence of heart disease (0: no disease; 1: heart disease present) | Nominal, binary |
Table 1: Heart disease variables
In order to investigate the physiological factors contributing to, and associated with, heart disease, a series of hypothesis tests were conducted.
The following hypotheses were investigated:
Pair | Null Hypothesis (H0) | Alternative Hypothesis (Ha) |
---|---|---|
1 | There is no relationship between a patient’s age and their resting blood pressure. | There is a relationship between a patient’s age and their resting blood pressure. |
2 | There is no relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve. | There is a relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve. |
3 | There is no difference in serum cholesterol level between patients who have heart disease and patients who do not. | There is a difference in serum cholesterol level between patients who have heart disease and patients who do not. |
4 | There are no differences in resting blood pressure between patients with different types of chest pain. | There are differences in resting blood pressure between patients with different types of chest pain. |
5 | There is no relationship between a patient’s thallium heart rate test results and the presence of heart disease. | There is a relationship between a patient’s thallium heart rate test results and the presence of heart disease. |
##
## Variables sorted by number of missings:
## Variable Count
## thal 0.00660066
## age 0.00000000
## cp 0.00000000
## trestbps 0.00000000
## chol 0.00000000
## thalach 0.00000000
## target 0.00000000
In order to determine the correct statistical test for each hypothesis, the data for each variable of inspected. Whether or not each variable could be considered to approximate a normal distribution was assessed, and consequently the appropriate statistical test was chosen.
Section 2 summarises the results of the tests conducted to determine which statistical test was appropriate for the data, and then presents the results of the hypothesis tests. The conclusions drawn for each hypothesis test are presented in Section 3.
H0: There is no relationship between a patient’s age and their resting blood pressure (trestbps).
HA: There is a relationship between a patient’s age and their resting blood pressure.
The age variable was tested for normality. Visual inspection of the the histogram shows that the data does approximate a normal distribution when plotted, with no apparent skewness or kurtosis (Figure 2). The qqplot for age also does not indicate substantial skewness in the data (Figure 3). Finally, standardised scores for both kurtosis and skewness were calculated. The score for kurtosis (-1.86) and the score for skewness (-1.49) were both within the acceptable range of +/- 2 proposed by Curran, West, and Finch (1996). No further testing was conducted, as the histogram, qqplot, and skewness and kurtosis standardised scores indicate that the age variable approximates a normal distribution.
The resting blood pressure (trestbps) variable was tested for normality. Visual inspection of the the histogram (Figure 4) and qqplot (Figure 5) indicate that there is a right skew present in the data. Subsequently, standardised scores for both kurtosis and skewness were calculated. The score for kurtosis (3.13) and the score for skewness (5.02) were both outside the acceptable range of +/- 2 proposed by Curran, West, and Finch (1996). However, over 95% of the standardised scores fall within the range of +/-1.96 (4.95% falls outside +/-1.96), and over 99% of standardised scores fall within the bounds of +/- 3.29 (0.66% falls outside +/-3.29), using the guidance of Field, Miles and Field (2013), the data is therefore considered to approximate a normal distribution (m=131.69, sd=17.6, n=303).
Given that the data for both age and resting blood pressure can be considered approximately normal, a Pearson’s correlation test was used to test the hypothesis that there is a relationship between age and resting blood pressure.
The relationship between patient age and patient resting blood pressure (trestbps) was investigated using Pearson’s correlation test. A small, or “weak”, but statistically significant positive correlation was found (r = 0.28, n = 301, p<.001). The relationship between the two variables is shown in Figure 6.
## `geom_smooth()` using formula = 'y ~ x'
H0: There is no relationship between a patient’s
serum cholesterol level and the maximum heart rate they achieve.
HA: There is a relationship between a patient’s serum
cholesterol level and the maximum heart rate they achieve.
The serum cholesterol (chol) variable was tested for normality. Visual inspection of the the histogram (Figure 7) and qqplot (Figure 8) indicate that there is kurtosis present, and a right skew in the data. Subsequently, standardised scores for both kurtosis and skewness were calculated. The score for kurtosis (15.96) and the score for skewness (8.07) were both outside the acceptable range of +/- 2 proposed by Curran, West, and Finch (1996). However, over 95% of the standardised scores fall within the range of +/-1.96 (3.63% falls outside +/-1.96), and over 99% of standardised scores fall within the bounds of +/- 3.29 (0.33% falls outside +/-3.29), using the guidance of Field, Miles and Field (2013), the data is therefore considered to approximate a normal distribution (m=246.69, sd=51.78, n=303).
The skewness in the data appears to be attributed to a single outlying value, as seen in the histogram (Figure 7). In order to minimise the influence of the outlier in the cholesterol variable (which has a value of 564 mg/dl, over 100 mg/dl higher than the next highest value) on future analyses, this value was removed.
The maximum heart rate (thalach) variable was tested for normality. Visual inspection of the the histogram (Figure 9) and qqplot (Figure 10) show a slight left skew in the data. Subsequently, standardised scores for both kurtosis and skewness were calculated. The standardised score for kurtosis (-0.19) was within the acceptable range of +/-2. However, the standardised score for skewness (-3.82) was outside the acceptable range of +/- 2. However, over 95% of the standardised scores fall within the range of +/-1.96 (3.96% falls outside +/-1.96), and over 99% of standardised scores fall within the bounds of +/- 3.29 (0.33% falls outside +/-3.29), using the guidance of Field, Miles and Field (2013), the data is therefore considered to approximate a normal distribution (m=149.61, sd=22.88, n=303).
As the data for both serum cholesterol level and maximum heart rate achieved can be considered approximately normal, a Pearson’s correlation test was used to test the hypothesis that there is a relationship between serum cholesterol level and maximum heart rate achieved.
The relationship between patient age and patient resting blood pressure (trestbps) was investigated using Pearson’s correlation test. No significant correlation between the two variables was found (r = -0.01, n = 300, p=0.82). The relationship between the two variables is shown in Figure 11.
## `geom_smooth()` using formula = 'y ~ x'
H0: There is no difference in serum cholesterol
level between patients who have heart disease and patients who do
not.
HA: There is a difference in serum cholesterol level
between patients who have heart disease and patients who do not.
The variable “chol” was inspected in Section 2.2.1, and the distribution was shown to approximate a normal distribution.
138 patients in this sample have heart disease, and 165 do not (Figure 12). There is no missing data in this variable.
As the data for serum cholesterol can be considered approximately normal and the heart disease status has two unrelated groups with homogeneous variance an independent t-test was used to test this hypothesis.
The serum cholesterol levels for patients with and without heart disease are displayed as boxplots, with the mean value indicated by a triangle (Figure 13).
An independent-samples t-test was conducted to compare the serum cholesterol levels for patients with and without heart disease. No significant difference in the serum cholesterol levels was found between the patients without heart disease (M = 251.09, SD = 49.45, n = 138) and those with heard disease (M= 241.06, SD = 47.39, n = 164 for patients with heart disease), (t(300) =1.795, p = 0.07). The effect size, as measured by Cohen’s d was 0.207, indicating that the difference between the two groups of patients was small (Sullivan & Feinn, 2012), as well as not being statistically significant.
H0: There are no differences in resting blood pressure (trestbps) between patients with different types of chest pain (cp).
HA: There are differences in resting blood pressure between patients with different types of chest pain.
143 patients experience typical angina type chest pain, 50 experience atypical angina type chest pain, 86 experience non-anginal pain, and 24 experience no chest pain (Figure 14).
## item group1 vars n mean sd median trimmed mad min max
## X11 1 0 1 143 132.0210 18.03614 130.0 130.5913 14.8260 100 200
## X12 2 1 1 50 128.4000 15.83718 128.0 126.7250 11.8608 101 192
## X13 3 2 1 86 130.2907 16.54859 130.0 129.9286 14.8260 94 180
## X14 4 3 1 24 141.5833 19.45992 142.5 141.5000 24.4629 110 178
## range skew kurtosis se
## X11 100 0.776930106 0.9586154 1.508258
## X12 91 1.433689629 3.7196530 2.239716
## X13 86 0.287317104 0.1918571 1.784480
## X14 68 -0.002877795 -1.1057784 3.972239
As the data for resting blood pressure can be considered approximately normal and chest pain has four unrelated groups with homogeneous variance an independent samples ANOVA was used to test this hypothesis, and Tukey’s Honest Significant Difference (HSD) test was used as a post-hoc test.
The resting blood pressure (trestbps) of patients with different types of chest pain are displayed as boxplots, with the mean value indicated by a triangle. Significant differences between pairs of groups are indicated with an asterisk, while the letters “ns” indicate no signficant difference between pairs (Figure 15).
A one-way between-groups analysis of variance (ANOVA) was conducted to explore differences in resting blood pressure between patients experiencing different types of chest pain. Patients were divided into four groups: typical angina, atypical angina, non-anginal pain, and asymptomatic (no pain). The analysis revealed a significant effect of chest pain type on resting blood pressure (F(3,299)=3.39,p=0.0185). The effect size, calculated using eta squared was 0.03, which indicates a small effect size.
Post-hoc comparison using the Tukey HSD test indicated that the mean resting blood pressure for asymptomatic patients (M = 141.58, SD = 19.46) was significantly different from that of patients with both non-anginal pain (M = 130.29, SD = 16.55), and atypical angina (M = 128.4, SD = 15.84). No significant differences between other pairs of groups were detected.
H0: There is no relationship between a patient’s thallium heart rate test results and the presence of heart disease.
HA: There is a relationship between a patient’s thallium heart rate test results and the presence of heart disease.
Table 2 shows the numbers of patients with and without heart disease, in each category of thallium heart rate test result (1-3).
Thallium heart rate test results and presence of heart disease | ||
Thal result | Heart Disease: YES | Heart Disease: NO |
---|---|---|
1 | 7 | 12 |
2 | 129 | 36 |
3 | 28 | 89 |
Table 2: Cross comparison proportions of heart disease status for each thallium heart rate test result: 1 = normal blood flow, 2 = fixed defect, 3 = reversible defect.
The observed and expected values for each of the thallium test results for patients with and without heart disease are displayed.
ctest$expected#expected frequencies
## target
## thal 0 1
## 1 8.647841 10.35216
## 2 75.099668 89.90033
## 3 53.252492 63.74751
ctest$observed#observed frequencies
## target
## thal 0 1
## 1 12 7
## 2 36 129
## 3 89 28
A Chi-Square test for independence indicated a strong, significant association between thallium heart rate test results and status of heart disease in patients χ²(2, n = 303) = 83.79 , p < 0.001), Cramer’s V = 0.53 .
As a result of the investigations conducted in section 2, the following conclusions have been drawn:
Brown JC, Gerhardt TE, Kwon E. Risk Factors for Coronary Artery
Disease. [Updated 2023 Jan 23]. In: StatPearls [Internet]. Treasure
Island (FL): StatPearls Publishing; 2024 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK554410/
Curran, Patrick J., Stephen G. West, and John F. Finch. (1996). “The
robustness of test statistics to nonnormality and specification error in
confirmatory factor analysis.” Psychological methods 1.1 (1996):
16.
Field, A., Field Z. & Miles J.(2012). Discovering statistics using
IBM SPSS statistics. Sage publications limited
Janosi, Andras, et al. “Heart Disease.” UCI Machine Learning Repository,
1989, https://doi.org/10.24432/C52P4X.
Sullivan, G. M., & Feinn, R. (2012). Using Effect Size-or Why the P
Value Is Not Enough. Journal of graduate medical education, 4(3),
279–282. https://doi.org/10.4300/JGME-D-12-00156.1
Tabachnik and Fidell, Using Multivariate Statistics, 6th Edition,
Pearson.