1. Introduction

This statistical analysis aims to investigate the biological factors that influence the development of heart disease, and the symptoms that can be indicative or predictive of heart disease development. In order to be able to accurately and preemptively predict heart disease in patients, a clear understanding of the physiological attributes that contribute to the development of heart disease is necessary. Further, understanding how these various traits can interact, influence, or correlate with each other can contribute to our understanding of the biological mechanisms that underlie blood vessel damage, and heart disease development. Attributes such as cholesterol level, blood pressure, blood sugar, age, heart rate, and and chest pain have all been implicated in some form with heart disease, but investigation is needed to determine if they are truly associated with the onset of heart disease, and if the incidences of these traits are correlated.

The dataset used in this analysis contains data from 303 individuals’ data from the Cleveland Clinic (Janosi, Andras, et al., 1989). Data was collected from patients referred to the Cleveland Clinic for coronary angiography between May 1981 and September 1984 by Robert Detrano, M.D., Ph.D of the Cleveland Clinic Foundation. The dataset contains 14 columns, from a larger set of 75 variables, which represent 13 different physiological attributes obtained from patients, and the final variable indicating the presence or absence of heart disease, called the “target” variable.

The variables included in this dataset are as follows: age, sex, chest pain, blood pressure at rest, serum cholesterol, resting electrocardiogram, fasting blood sugar, maximum heart rate, oldpeak, slope, exercise induced angina, thallium heart rate test result, and target (which indicates the presence or absence of heart disease).

Details of the variables are included in Table 1.

Variable name	Details	Variable Type
age	Patient age in years	Numeric, ratio
sex	Patient biological sex (Female:0; Male:1)	Nominal, binary
cp	Type of chest pain experience by the patient (0: typical angina; 1: atypical angina; 2: non-anginal pain; 3: asymptomatic)	Nominal
trestbps	Patient resting blood pressure (mm/Hg)	Numeric, ratio
chol	Serum cholesterol (mg/dl)	Numeric, ratio
fbs	Fasting blood sugar > 120 mg/dl (0: False; 1: True)	Nominal, binary
restecg	Resting electrocardiographic results (0: normal; 1: ST-wave abnormality; 2: probable or definite left ventricular hypertrophy)	Nominal
thalach	Maximum heart rate achieved (bpm)	Numeric, ratio
exang	Exercise induced angina (0: no angina; 1: angina present)	Nominal, binary
oldpeak	Stress test depression induced by exercise relative to rest	Numeric, ratio
slope	The slope of the peak exercise ST segment (0 = upsloping; 1: flat; 2: downsloping	Nominal
ca	Number of major vessels coloured by fluoroscopy (0-3)	Ordinal
thal	Thallium heart rate test results (1: normal blood flow; 2: fixed defect (no blood flow in some part of the heart); 3: reversible defect (abnormal blood flow)	Nominal
target	Indicates the presence of heart disease (0: no disease; 1: heart disease present)	Nominal, binary

Table 1: Heart disease variables

In order to investigate the physiological factors contributing to, and associated with, heart disease, a series of hypothesis tests were conducted.

The following hypotheses were investigated:

Pair	Null Hypothesis (H0)	Alternative Hypothesis (Ha)
1	There is no relationship between a patient’s age and their resting blood pressure.	There is a relationship between a patient’s age and their resting blood pressure.
2	There is no relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve.	There is a relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve.
3	There is no difference in serum cholesterol level between patients who have heart disease and patients who do not.	There is a difference in serum cholesterol level between patients who have heart disease and patients who do not.
4	There are no differences in resting blood pressure between patients with different types of chest pain.	There are differences in resting blood pressure between patients with different types of chest pain.
5	There is no relationship between a patient’s thallium heart rate test results and the presence of heart disease.	There is a relationship between a patient’s thallium heart rate test results and the presence of heart disease.

Prior to conducting statistical testing, the variables of interest were evaluated for the level of missingness, to determine if missing data represented a source of bias (Figure 1). Of all the variables of interest, only the “thal” variable had missing values, with a proportion of 0.007 missing. The proportion of the missing data is very low for the “thal”, the sample size of over 300 individuals is high, and as there is no observable pattern to the missing data, according to the recommendation of Tabachnick and Fidell (2016), the missing data does not present a source of bias in the analysis.

Figure 1: Inspection of Missing Data

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##      thal 0.00660066
##       age 0.00000000
##        cp 0.00000000
##  trestbps 0.00000000
##      chol 0.00000000
##   thalach 0.00000000
##    target 0.00000000

In order to determine the correct statistical test for each hypothesis, the data for each variable of inspected. Whether or not each variable could be considered to approximate a normal distribution was assessed, and consequently the appropriate statistical test was chosen.

Section 2 summarises the results of the tests conducted to determine which statistical test was appropriate for the data, and then presents the results of the hypothesis tests. The conclusions drawn for each hypothesis test are presented in Section 3.

2. Hypotheses

2.1 Relationship between patient age and resting blood pressure.

H0: There is no relationship between a patient’s age and their resting blood pressure (trestbps).

HA: There is a relationship between a patient’s age and their resting blood pressure.

2.1.1 Inspecting age and trestbps

The age variable was tested for normality. Visual inspection of the the histogram shows that the data does approximate a normal distribution when plotted, with no apparent skewness or kurtosis (Figure 2). The qqplot for age also does not indicate substantial skewness in the data (Figure 3). Finally, standardised scores for both kurtosis and skewness were calculated. The score for kurtosis (-1.86) and the score for skewness (-1.49) were both within the acceptable range of +/- 2 proposed by Curran, West, and Finch (1996). No further testing was conducted, as the histogram, qqplot, and skewness and kurtosis standardised scores indicate that the age variable approximates a normal distribution.

Figure 2: Histogram for Patient Age

Figure 3: QQPlot for Patient Age

The resting blood pressure (trestbps) variable was tested for normality. Visual inspection of the the histogram (Figure 4) and qqplot (Figure 5) indicate that there is a right skew present in the data. Subsequently, standardised scores for both kurtosis and skewness were calculated. The score for kurtosis (3.13) and the score for skewness (5.02) were both outside the acceptable range of +/- 2 proposed by Curran, West, and Finch (1996). However, over 95% of the standardised scores fall within the range of +/-1.96 (4.95% falls outside +/-1.96), and over 99% of standardised scores fall within the bounds of +/- 3.29 (0.66% falls outside +/-3.29), using the guidance of Field, Miles and Field (2013), the data is therefore considered to approximate a normal distribution (m=131.69, sd=17.6, n=303).

Figure 4: Histogram for Patient resting blood pressure

Figure 5: QQPlot for Patient resting blood pressure

Given that the data for both age and resting blood pressure can be considered approximately normal, a Pearson’s correlation test was used to test the hypothesis that there is a relationship between age and resting blood pressure.

2.1.2 Testing hypothesis 1

The relationship between patient age and patient resting blood pressure (trestbps) was investigated using Pearson’s correlation test. A small, or “weak”, but statistically significant positive correlation was found (r = 0.28, n = 301, p<.001). The relationship between the two variables is shown in Figure 6.

## `geom_smooth()` using formula = 'y ~ x'

Figure 6: Scatterplot of Patient Age and Resting Blood Pressure (trestbps)

2.2 Relationship between patient’s serum cholesterol level and maximum heart rate achieved

H0: There is no relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve.
HA: There is a relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve.

2.2.1 Inspecting serum cholesterol (chol) and maximum heart rate (thalach)

The serum cholesterol (chol) variable was tested for normality. Visual inspection of the the histogram (Figure 7) and qqplot (Figure 8) indicate that there is kurtosis present, and a right skew in the data. Subsequently, standardised scores for both kurtosis and skewness were calculated. The score for kurtosis (15.96) and the score for skewness (8.07) were both outside the acceptable range of +/- 2 proposed by Curran, West, and Finch (1996). However, over 95% of the standardised scores fall within the range of +/-1.96 (3.63% falls outside +/-1.96), and over 99% of standardised scores fall within the bounds of +/- 3.29 (0.33% falls outside +/-3.29), using the guidance of Field, Miles and Field (2013), the data is therefore considered to approximate a normal distribution (m=246.69, sd=51.78, n=303).

Figure 7: Histogram for Patient serum cholesterol

Figure 8: QQPlot for Patient serum cholesterol

The skewness in the data appears to be attributed to a single outlying value, as seen in the histogram (Figure 7). In order to minimise the influence of the outlier in the cholesterol variable (which has a value of 564 mg/dl, over 100 mg/dl higher than the next highest value) on future analyses, this value was removed.

The maximum heart rate (thalach) variable was tested for normality. Visual inspection of the the histogram (Figure 9) and qqplot (Figure 10) show a slight left skew in the data. Subsequently, standardised scores for both kurtosis and skewness were calculated. The standardised score for kurtosis (-0.19) was within the acceptable range of +/-2. However, the standardised score for skewness (-3.82) was outside the acceptable range of +/- 2. However, over 95% of the standardised scores fall within the range of +/-1.96 (3.96% falls outside +/-1.96), and over 99% of standardised scores fall within the bounds of +/- 3.29 (0.33% falls outside +/-3.29), using the guidance of Field, Miles and Field (2013), the data is therefore considered to approximate a normal distribution (m=149.61, sd=22.88, n=303).

Figure 9: Histogram for Maximum heart rate (bpm)

Figure 10: QQPlot for Maximum heart rate (bpm)

As the data for both serum cholesterol level and maximum heart rate achieved can be considered approximately normal, a Pearson’s correlation test was used to test the hypothesis that there is a relationship between serum cholesterol level and maximum heart rate achieved.

2.2.2 Testing Hypothesis 2

The relationship between patient age and patient resting blood pressure (trestbps) was investigated using Pearson’s correlation test. No significant correlation between the two variables was found (r = -0.01, n = 300, p=0.82). The relationship between the two variables is shown in Figure 11.

## `geom_smooth()` using formula = 'y ~ x'

Figure 11: Scatterplot of patient maximum heart rate and serum cholesterol

2.3 Difference in serum cholesterol levels (chol) for patients with or without heart disease

H0: There is no difference in serum cholesterol level between patients who have heart disease and patients who do not.
HA: There is a difference in serum cholesterol level between patients who have heart disease and patients who do not.

2.3.1 Inspecting cholesterol and heart disease (target)

The variable “chol” was inspected in Section 2.2.1, and the distribution was shown to approximate a normal distribution.

138 patients in this sample have heart disease, and 165 do not (Figure 12). There is no missing data in this variable.

Figure 12: Proportions of Patients With and Without Heart Disease

The serum cholesterol level data can be considered to approximate a normal distribution (see section 2.2.1). To assess if there is homogeneity of variance of the cholesterol data for the two groups, those with and without heart disease, Levene’s test for homogenetity of variance was conducted. The results from this test (Df = 301, F = 0.72, p = 0.398) indicate that there is homogeneity of variance in the serum cholesterol variable between the two groups, as the p value is greater than 0.05.

As the data for serum cholesterol can be considered approximately normal and the heart disease status has two unrelated groups with homogeneous variance an independent t-test was used to test this hypothesis.

2.3.2 Testing hypothesis 3

Figure 13: Serum cholesterol levels in patients with and without heart disease

The serum cholesterol levels for patients with and without heart disease are displayed as boxplots, with the mean value indicated by a triangle (Figure 13).

An independent-samples t-test was conducted to compare the serum cholesterol levels for patients with and without heart disease. No significant difference in the serum cholesterol levels was found between the patients without heart disease (M = 251.09, SD = 49.45, n = 138) and those with heard disease (M= 241.06, SD = 47.39, n = 164 for patients with heart disease), (t(300) =1.795, p = 0.07). The effect size, as measured by Cohen’s d was 0.207, indicating that the difference between the two groups of patients was small (Sullivan & Feinn, 2012), as well as not being statistically significant.

2.4 Differences in resting blood pressure in patients with different types of chest pain.

H0: There are no differences in resting blood pressure (trestbps) between patients with different types of chest pain (cp).

HA: There are differences in resting blood pressure between patients with different types of chest pain.

2.4.1 Inspection of chest pain and trestbps

143 patients experience typical angina type chest pain, 50 experience atypical angina type chest pain, 86 experience non-anginal pain, and 24 experience no chest pain (Figure 14).

##     item group1 vars   n     mean       sd median  trimmed     mad min max
## X11    1      0    1 143 132.0210 18.03614  130.0 130.5913 14.8260 100 200
## X12    2      1    1  50 128.4000 15.83718  128.0 126.7250 11.8608 101 192
## X13    3      2    1  86 130.2907 16.54859  130.0 129.9286 14.8260  94 180
## X14    4      3    1  24 141.5833 19.45992  142.5 141.5000 24.4629 110 178
##     range         skew   kurtosis       se
## X11   100  0.776930106  0.9586154 1.508258
## X12    91  1.433689629  3.7196530 2.239716
## X13    86  0.287317104  0.1918571 1.784480
## X14    68 -0.002877795 -1.1057784 3.972239

Figure 14: Proportions of Patients With Different Types of Chest Pain

The resting blood pressure data can be considered to approximate a normal distribution (see section 2.1.1). To assess if there is homogeneity of variance of the resting blood pressure variable for the four chest pain groups, Barlett’s test for homogeneity of variances was conducted. Though the sample sizes for each of the four groups are not equal, the variances between the groups are approximately equal according to the results of Bartlett’s test (K-squared = 2.2, df = 3, p = 0.53).

As the data for resting blood pressure can be considered approximately normal and chest pain has four unrelated groups with homogeneous variance an independent samples ANOVA was used to test this hypothesis, and Tukey’s Honest Significant Difference (HSD) test was used as a post-hoc test.

2.4.2 Hypothesis test 4

Figure 15: Resting blood pressure of patients with different types of chest pain

The resting blood pressure (trestbps) of patients with different types of chest pain are displayed as boxplots, with the mean value indicated by a triangle. Significant differences between pairs of groups are indicated with an asterisk, while the letters “ns” indicate no signficant difference between pairs (Figure 15).

A one-way between-groups analysis of variance (ANOVA) was conducted to explore differences in resting blood pressure between patients experiencing different types of chest pain. Patients were divided into four groups: typical angina, atypical angina, non-anginal pain, and asymptomatic (no pain). The analysis revealed a significant effect of chest pain type on resting blood pressure (F(3,299)=3.39,p=0.0185). The effect size, calculated using eta squared was 0.03, which indicates a small effect size.

Post-hoc comparison using the Tukey HSD test indicated that the mean resting blood pressure for asymptomatic patients (M = 141.58, SD = 19.46) was significantly different from that of patients with both non-anginal pain (M = 130.29, SD = 16.55), and atypical angina (M = 128.4, SD = 15.84). No significant differences between other pairs of groups were detected.

2.5 Relationship between a patient’s thallium heart rate test results and the presence of heart disease

H0: There is no relationship between a patient’s thallium heart rate test results and the presence of heart disease.

HA: There is a relationship between a patient’s thallium heart rate test results and the presence of heart disease.

2.5.1 Inspection of thallium heart rate test results and target variable

Table 2 shows the numbers of patients with and without heart disease, in each category of thallium heart rate test result (1-3).

Thal result	Heart Disease: YES	Heart Disease: NO
Thallium heart rate test results and presence of heart disease
1	7	12
2	129	36
3	28	89

Table 2: Cross comparison proportions of heart disease status for each thallium heart rate test result: 1 = normal blood flow, 2 = fixed defect, 3 = reversible defect.

The observed and expected values for each of the thallium test results for patients with and without heart disease are displayed.

ctest$expected#expected frequencies

##     target
## thal         0        1
##    1  8.647841 10.35216
##    2 75.099668 89.90033
##    3 53.252492 63.74751

ctest$observed#observed frequencies

##     target
## thal   0   1
##    1  12   7
##    2  36 129
##    3  89  28

A Chi-Square test for independence indicated a strong, significant association between thallium heart rate test results and status of heart disease in patients χ²(2, n = 303) = 83.79 , p < 0.001), Cramer’s V = 0.53 .

3. Conclusions

As a result of the investigations conducted in section 2, the following conclusions have been drawn:

Pair 1
- H0: There is no relationship between a patient’s age and their resting blood pressure (trestbps).
- HA: There is a relationship between a patient’s age and their resting blood pressure.
- Conclusion: A small, positve, statistically significant correlation was found between a patient’s age and their resting blood pressure. Therefore, there is sufficient evidence to reject the null hypothesis in favour of the alternate hypothesis. Results indicate that increase in patient age is associated with higher blood pressure levels.
Pair 2
- H0: There is no relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve.
- HA: There is a relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve.
- Conclusion: There is no statistically significant relationship between a patient’s serum cholesterol level and the maximum heart rate they achieve. There is not sufficient evidence to reject the null hypothesis in favour of the alternate hypothesis. It appears that there is no association between cholesterol level and maxiumum heart rate in these patients.
Pair 3
- H0: There is no difference in serum cholesterol level between patients who have heart disease and patients who do not.
- HA: There is a difference in serum cholesterol level between patients who have heart disease and patients who do not.
- Conclusion: No statistically significant difference was found in the serum cholesterol levels for patients with and without heart disease. There is therefore not sufficient evidence to support rejecting the null hypothesis in favour of the alternate hypothesis.
Pair 4
- H0: There are no differences in resting blood pressure (trestbps) between patients with different types of chest pain (cp).
- HA: There are differences in resting blood pressure between patients with different types of chest pain.
- Conclusion: A statistically significant difference in resting blood pressure was found for patients with different types of chest pain. This difference was found to exist between asymptomatic patients and patients with atypical angina, and asymptomatic patients and patients with non-anginal pain. The difference detected, furthermore, was small, as indicated by an eta squared value of 0.03. Therefore, there is evidence to support rejecting the null hypothesis in favour of the alternate hypothesis, but the small effect size indicates that rejecting the null hypothesis should be treated with caution. Further data collection by adding an additional cohort of patients to the data could further clarify the effect chest pain has on patient blood pressure.
Pair 5
- H0: There is no relationship between a patient’s thallium heart rate test results and the presence of heart disease.
- HA: There is a relationship between a patient’s thallium heart rate test results and the presence of heart disease.
- Conclusion: A strong positive statistically significant association was found between a patient’s thallium heart rate test result and their heart disease status. There is therefore evidence to reject the null hypothesis and the result indicates that patients with patients with heart disease tend to have certain thallium heart rate test results more often than those without heart disease.

4. References

Brown JC, Gerhardt TE, Kwon E. Risk Factors for Coronary Artery Disease. [Updated 2023 Jan 23]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK554410/
Curran, Patrick J., Stephen G. West, and John F. Finch. (1996). “The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis.” Psychological methods 1.1 (1996): 16.
Field, A., Field Z. & Miles J.(2012). Discovering statistics using IBM SPSS statistics. Sage publications limited
Janosi, Andras, et al. “Heart Disease.” UCI Machine Learning Repository, 1989, https://doi.org/10.24432/C52P4X.
Sullivan, G. M., & Feinn, R. (2012). Using Effect Size-or Why the P Value Is Not Enough. Journal of graduate medical education, 4(3), 279–282. https://doi.org/10.4300/JGME-D-12-00156.1
Tabachnik and Fidell, Using Multivariate Statistics, 6th Edition, Pearson.

Heart Disease Statistical Analysis

Grace Pender