

REVIEW ARTICLE 

Year : 2014  Volume
: 35
 Issue : 2  Page : 119123 


Some basic aspects of statistical methods and sample size determination in health science research
VS Binu^{1}, Shreemathi S Mayya^{1}, Murali Dhar^{2}
^{1} Department of Statistics, Manipal University, Manipal, Karnataka, India ^{2} Department of Population Policies and Programmes, International Institute for Population Sciences, Deonar, Mumbai, Maharashtra, India
Date of Web Publication  5Dec2014 
Correspondence Address: Murali Dhar Asso. Prof. Department of Population Policies and Programmes, International Institute for Population Sciences, Govandi Station Road, Deonar, Mumbai  400 088, Maharashtra India
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/09748520.146202
Abstract   
A health science researcher may sometimes wonder "why statistical methods are so important in research?" Simple answer is that, statistical methods are used throughout a study that includes planning, designing, collecting data, analyzing and drawing meaningful interpretation and report the findings. Hence, it is important that a researcher knows the concepts of at least basic statistical methods used at various stages of a research study. This helps the researcher in the conduct of an appropriately welldesigned study leading to valid and reliable results that can be generalized to the population. A welldesigned study possesses fewer biases, which intern gives precise, valid and reliable results. There are many statistical methods and tests that are used at various stages of a research. In this communication, we discuss the overall importance of statistical considerations in medical research with the main emphasis on estimating minimum sample size for different study objectives. Keywords: Applications, healthsciencesresearch, samplesize, statistics
How to cite this article: Binu V S, Mayya SS, Dhar M. Some basic aspects of statistical methods and sample size determination in health science research. AYU 2014;35:11923 
How to cite this URL: Binu V S, Mayya SS, Dhar M. Some basic aspects of statistical methods and sample size determination in health science research. AYU [serial online] 2014 [cited 2020 Feb 28];35:11923. Available from: http://www.ayujournal.org/text.asp?2014/35/2/119/146202 
Introduction   
Statistics is a branch of science concerned with (i) The collection, organization, summarization and analysis of data and (ii) the drawing of inferences about whole body of data when only a part of the data is observed. ^{[1]} Statistics has a role in a study starting from planning, designing, collecting data, analyzing until drawing meaningful interpretation from it. Many health science researchers may wonder around the question "why statistical methods are so important in research?" The answer is "bad statistics lead to bad research and bad research is unethical." ^{[2]} Statistical methods revolutionized science in 20 ^{th} century ^{[3]} with a vast majority of advanced methodological developments taking place during this period. Further, the invention of high end computers toward the end of the last century enabled the researchers to apply advanced statistical methods in their research with ease and comfort in computations. The ease in computations has, at least to some extent; changed the course of analytical considerations. For example, sophisticated multivariate methods are no more considered as difficult to apply from computations point of view. Today, statistics is an indispensable tool in each and every field of health science research, whether it is Medicine, Ayurveda, Pharmacy, Dental or other allied health sciences. Statistics helps even clinicians in extracting vital information from the empirical data that ultimately lead to improved patient care. Statistical concepts are required to be considered throughout a study, from planning to the final reporting stage. This article provides a brief overview of statistical methods used at various stages of a research study with the main emphasis on estimation of minimum sample size for various types of objectives.
Role of Statistics in Research Studies   
The first step in any research study is to clearly state aims of the study followed by its objectives, which are focused at achieving the aim. Many use the terms aim and objective interchangeably, whereas the two terms have their own meaning with clearcut distinction. Aim of the study is a general statement about main and broad study question, meaning that it is not measurable. On the other hand, objectives of a research study should be (Specific, Measureable, Achievable, Relevant and Timebound) and limited in number. Before stating the objectives, the researcher should know about the types of variables and/or attributes being assessed in the study, especially the exposure and outcome measurements and distinction between the two in the analytical studies. This is required in order to ensure the characteristics of the objectives stated above. The second step is to choose the appropriate study design for meeting the stated objectives, which depends on many factors related to study, such as, study question/aim, availability of the fund and other logistic resources, type of study subjects/variables, etc., After selecting the study design, the third step is to define the target population and also the method of selecting study participants from this population. Statistical techniques called sampling types/methods are used for selecting subjects from a study population to form the sample. There are broadly two types of sampling methods namely probability (random) sampling and nonprobability (nonrandom) sampling methods. Statistical methods pertaining to inference have been developed based on assumptions that study participants are randomly selected, i.e., the sampling method used should be a probability sampling. Even within probability sampling, there are a number of ways of selecting a sample randomly. The choice of probability sampling method to be used in a study depends on the objective of the study, heterogeneity of characteristics of individuals in the study population and also the feasibility of getting study participants in terms of cost, time and manpower.
Once the study design and sampling technique has been finalized the fourth step in a study is to estimate the sample size. This aspect is being dealt in detail in the present communication.
The fifth step in a research study is to decide on data collection tools used for collecting relevant information from the selected participants. Some studies require the researcher to develop new tools for data collection. Statistical methods are used for the development and validation of research tools. Reliability coefficients such as Cronbach's alpha, intraclass correlation coefficient, splithalf coefficient etc., are used for measuring the reliability of a questionnaire. Another aspect of tool preparation is validity, which requires looking for content, criterion and face validity. Once data collection is over it has to undergo quality checking, coding and computer entry.
The sixth step in a research study is to perform appropriate statistical analysis. The first step in data analysis is to compute descriptive statistics for all important variables in the study. The descriptive statistics summarize various aspects about the data, giving details about the selected sample. ^{[4]} To summarize the data, we use several statistical summary measures such as mean, median, standard deviation, inter quartile range, percentages etc., depending upon the type of variable, measurement scale used and the variability in each of the variables. Descriptive statistics is followed by inferential statistics dealing with generalization of the sample results to the population. Confidence intervals (CIs) are presented to take account of sampling error involved in the estimates and tests of significance are applied to find whether the results observed from a sample is due to chance or not. There are many statistical tests, but the choice depends on many factors such as the objective of the study, type of variables and measurement scales used, sample size, number of groups to be compared, number of variables, distribution of the outcome variable etc. ^{[5]} Statistical hypothesis tests are broadly classified in to two categories; based on whether they mandate the assumption of the distribution of variable under study as normal or not. Parametric tests have been built under the assumption that the variable under consideration follows the normal distribution. Nonparametric tests on the other hand, do not require any distributional assumption about the variable under consideration in the study population. Hence, nonparametric tests are called distribution free tests. However, one should know that the nonparametric tests bring the ease in their application on the cost of sacrificing the power of the study, a main concern in clinical studies.
The final step in a research study is to communicate the results and interpretations using the appropriate figures and tables. A table of appropriate summary measures for variables in the study with the number of subjects in each category is a necessary one to brief the sampled data. Reporting a CI for every population parameter became a mandatory prerequisite in modern research. Several authors have discussed about the principles and reporting of CIs. ^{[6],[7],[8],[9],[10],[11],[12],[13],[14]} It is recommended to report both CI and exact P value instead of reporting only P < 0.05 or as P < 0.01 etc., The next section discuss about the determination of minimum sample size required for studies with a common type of objectives.
Estimation of Minimum Sample Size Required for a Study   
The problem of sample size estimation can be broadly of two types namely (a) Sample size for an estimation study and (b) sample size for a hypothesis testing study, i.e., comparison study. In an estimation study, the researcher is interested in estimating the quantum of one or more characteristics of the population called parameter(s), for example, mean hemoglobin level or prevalence of arthritis, etc., In hypothesis testing studies, the investigators are interested in comparing a characteristic of the population for one or more time points or a characteristic of two or more populations, for example, comparison of prevalence of arthritis before and after administration of some intervention or between two populations. The objective of calculating sample size in an estimation study is to estimate the value of the parameter under study for a prefixed precision and level of confidence. If a researcher wants the estimate to be more precise in his study then he should select a large number of subjects, i.e., as precision increases (or margin of error decreases) the minimum sample size required increases. Similarly, the sample size increases with the increase in level of confidence. For example, the sample size required for estimating a parameter with 99% confidence level is more than that required for 95% confidence level. In testing of hypothesis studies, the objective of sample size calculation is to achieve a desired power for detecting a clinically or scientifically meaningful difference at a prefixed level of significance. ^{[15]} Power of a study is the probability of rejecting a null hypothesis, which is false. Level of significance is the threshold set on the probability of rejecting a null hypothesis, which is true. In this section, we see how one can find the minimum sample size for (a) Estimating a population mean, (b) estimating a population proportion, (c) testing the equality of two means and (d) testing the equality of two proportions.
Sample size for estimating a population mean
The sample size formula for estimating a population mean is given by
where, Z _{1α/2} is the value obtained from the standard normal distribution table for 100 (1α)% confidence level. Its value for 95% and 99% confidence level is 1.96 and 2.58 respectively.
ó is the population standard deviation, which is majority of times unknown. The value of ó can be taken from a similar published study or based on a pilot study. For the pilot study, there is no need of sample size estimation; one can do a pilot study according to the number available. d is the absolute allowable error (or precision) in the estimation.
Thus, the minimum sample size (n) required for estimating a population mean is directly proportional to the level of confidence and the standard deviation, whereas it is inversely proportional to the absolute error that is allowed in the estimation.
For example, an Ayurveda physician wants to estimate the average age of patients visiting his clinic. If the allowable error in estimating the average age is within ±2 years, confidence level of 95% and assumed the standard deviation of age of patients visiting the clinic is 8 years, the minimum sample size required for his study is obtained as
Sample size for estimating a population proportion (or prevalence)
The sample size formula for estimating a population proportion is given by
where, Z _{1α/2} is as defined earlier, P is the anticipated proportion of the condition in the study population and å is the relative precision. An estimate for the anticipated proportion in the study population can be obtained from previous studies conducted in the same population or from a pilot study.
Thus, the minimum sample size n required for estimating the proportion of a rare condition in the study population is large compared with that of a common condition. Furthermore, the sample size decreases as the value of relative precision å increases.
For example, a survey was planned to assess the prevalence of current Ayurveda use among an urban adult population in India. It was anticipated that in the study population the current prevalence of Ayurveda use was 30%. The minimum sample size required for this study for relative precision of 10% and confidence level of 95% is given by
Hence, the above study requires a minimum of 897 participants to estimate the prevalence of Ayurveda use in the study population with relative precision of 10% and confidence level of 95%.
Sample size for testing equality of two population means
If the primary objective of a study is to test the null hypothesis of equality of means of two independent populations then the formula for estimating the sample size in each of the study group is given by
where, n is the minimum sample size required in each group (equal allocation and hence total sample size is 2n), Z _{1} α/2 is the twotailed standard normal distribution table value for 100 α% level of significance, Z _{1β} is the value of standard normal distribution table value for 100 β% Type II error or 100× α% power of the study. Usually, the power of a study is selected as 80% (β = 0.2) or 90% (β = 0.1) and the value of Z _{1β} is respectively 0.842 and 1.282.
S is the pooled standard deviation (or average of two population standard deviations), which can be obtained from previous literature or from a pilot study.
The most important component in sample size estimation for hypothesis testing is fixing the minimum clinically or scientifically meaningful difference between two population means, which is given by d in the above equation. The researcher has to fix the value of d based on clinical or scientific judgment and not based on results of the pilot study or previous literature. The clinically significant difference should be a difference that makes some real difference. Thus, the minimum sample size required for testing the equality of two population means depends on:
 The level of significance (inversely proportional)
 Power of the study (directly proportional)
 Pooled standard deviation of the variable under study (directly proportional)
 The minimum clinically significant difference (inversely proportional).
The above formula is used for selecting equal number of subjects in each of the study group. Sometimes the investigators want higher number of subjects in control/standard drug group. For example, in a clinical trial the research team wants more number of subjects to be recruited in the standard treatment arm than in the new treatment arm. If the sample size n _{c} in the standard treatment group is to be k times the size n _{t} of new treatment group, i.e., n _{c} = k n _{t} , then n _{t} is given by
Thus, the total number of subjects required for a clinical trial in which the allocation to the treatment and controlled group is in the ratio of 1:2 is 12.5% more than that required for a 1:1 allocation. Similarly, a study which allocates 1:3 requires a sample size of 33% more than that required for 1:1 allocation. ^{[16]} The sample size required for comparison of more than two population means depends on the number of groups to be compared in addition to all the above mentioned factors. In this case, the value of Z _{1α/2} has to be adjusted for the number of comparisons made between the study populations. Hence, sample size in this situation increases with the increase in number of group comparisons.
For example, a randomized controlled trial has been planned to compare the efficacy of an Ayurvedic medication with a placebo in controlling the thyroid stimulating hormone (TSH) levels among subjects with hypothyroidism. The researchers consider a difference of 3 mlU/L in TSH level between the two groups as clinically important and they want 80% power to detect this difference at 5% level of significance. A pilot study showed the pooled standard deviation of TSH levels to be 5 mlU/L. Then, the minimum number of subjects required in each group is computed as
Sample size for testing equality of two population proportions
The sample size required for testing the null hypothesis of equal proportion of a particular condition under study in two independent populations is given by
n is the minimum number of subjects required in each group, P _{1} and P _{2} are proportions of the disease or condition expected in two independent populations. In the above equation, the denominator term P _{1} P _{2} is the minimum clinically significant difference between the two proportions and as mentioned earlier the researcher has to fix it. Once the researcher fixes the minimum clinically significant difference and also he knows either P1 or P2, then he can obtain the proportion in the other group.
For example it was planned to conduct a randomized controlled trial to compare the percentage survival 5 years of diagnosis of breast cancer between two treatment groups. One half of the study participants with confirmed breast cancer will be randomly allocated to intervention (a new treatment) group and the other half to the control group (standard treatment). The investigators consider a difference of 6% between the two groups as clinically significant. It is known that the percentage of survival after 5 years in the standard treatment group is 70%. Then one would expect a percentage survival of 76% (or 64%) in the intervention group. Then, the minimum sample size required in each group at 80% power and 5% level of significance is obtained as (assuming no loss to followup)
Approximately, 860 breast cancer patients are required in each group to detect a clinically significant difference of 6% between the two treatment groups at 80% power and 5% level of significance. In the above example if the clinically significant difference is fixed at 5% (i.e., in the intervention group we expect 75% survival after 5 years) then the minimum number of subjects required in each group is 1251.
The above mentioned four sample size formulas assume that the sample being selected using simple random sampling technique and the sampling distribution follows normal. There are many more sample size formulae available in the literature depending on the situations; for example, if one wants to estimate a correlation coefficient or test a regression coefficient then one needs to use a different set of formulas. Thus, the sample size formula to be used in a study depends on:
 The main objective of the study
 The primary outcome variable
 The study design used
 The sampling technique used and
 The summary statistics and the type of statistical analysis (estimation or testing of hypothesis) used for the main objective in the study.
There is an abundance of literature available regarding the choice and principles of sample size estimation. ^{[16],[17],[18],[19],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31],[32]} Furthermore, there are many statistical packages that provide estimates of minimum sample size. However, the researchers must provide the relevant information in order to get the correct number and to do that a researcher should have complete conceptual understanding of determining minimum sample size.
Conclusion   
Statistical methods are applied in almost all stages of a research study from the planning and design stage to the final reporting of findings. Utmost care must be taken in choosing the right statistical method at each stage of a study. Sample size estimation is very important and choosing the right formula is crucial in all types of research studies. The selection of appropriate sample size formula depends on the primary objective of the study, the type of outcome variable, study design used, the statistical analysis planned, number of groups in the study and sampling technique to be adopted. The number of subjects to be recruited in a study depends on the power of the study, the precision of the estimated value, level of significance or confidence level, clinically significant difference and also other constraints such as money, manpower, availability of subjects and time in particular and feasibility in general. There are many statistical tests and the choice of appropriate analysis depends on the objective of the study, types of the outcome variable, number of variables, sample size, number of groups in the study, whether the groups are related or not and also on distributional assumptions.
References   
1.  Daniel WW. Biostatistics: A Foundation for Analysis in Health Sciences. 7 ^{th} ed. Singapore, Asia: John Wiley and Sons Pte. Ltd.; 2004. 
2.  Bland M. An Introduction to Medical Statistics. 2 ^{nd} ed. England: Oxford University Press; 1995. 
3.  Applegate KE, Crewson PE. Statistical literacy. Radiology 2004;230:6134. 
4.  Larson MG. Descriptive statistics and graphical displays. Circulation 2006;114:7681. [ PUBMED] 
5.  McCluskey A, Lalkhen GA. Statistics III: Probability and statistical tests. Contin Educ Anaest Crit Care Pain 2007;7:16770. 
6.  Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al. Tips for learners of evidencebased medicine: 2. Measures of precision (confidence intervals). CMAJ 2004;171:6115. 
7.  CurranEverett D. Explorations in statistics: Confidence intervals. Adv Physiol Educ 2009;33:8790. [ PUBMED] 
8.  du Prel JB, Hommel G, Röhrig B, Blettner M. Confidence interval or Pvalue?: Part 4 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2009;106:3359. 
9.  Gardner MJ, Altman DG. Confidence intervals rather than Pvalues: Estimation rather than hypothesis testing. Br Med J (Clin Res Ed) 1986;292:74650. [ PUBMED] 
10.  Poole C. Low Pvalues or narrow confidence intervals: Which are more durable? Epidemiology 2001;12:2914. [ PUBMED] 
11.  Rigby AS. Getting past the statistical referee: Moving away from Pvalues and towards interval estimation. Health Educ Res 1999;14:7135. [ PUBMED] 
12.  Potter RH. Significance level and confidence interval. J Dent Res 1994;73:4946. [ PUBMED] 
13.  Evans SJ, Mills P, Dawson J. The end of the P value? Br Heart J 1988;60:17780. 
14.  Attia A. Why should researchers report the confidence interval in modern research? Middle East Fertil Soc J 2005;10:7881. 
15.  Chow CS, Wang H, Shao J. Sample Size Calculation in Clinical Research. USA: Chapman and Hall/CRC Press; 2003. 
16.  Wittes J. Sample size calculations for randomized controlled trials. Epidemiol Rev 2002;24:3953. [ PUBMED] 
17.  Endacott R, Botti M. Clinical research 3: Sample selection. Accid Emerg Nurs 2007;15:2348. 
18.  Neely JG, Karni RJ, Engel SH, Fraley PL, Nussenbaum B, Paniello RC. Practical guides to understanding sample size and minimal clinically important difference (MCID). Otolaryngol Head Neck Surg 2007;136:148. 
19.  Devane D, Begley CM, Clarke M. How many do I need? Basic principles of sample size estimation. J Adv Nurs 2004;47:297302. 
20.  Columb OM, Stevens A. Power analysis and sample size calculations. Curr Anaesth Crit Care 2008;19:124. 
21.  Brasher PM, Brant RF. Sample size calculations in randomized trials: Common pitfalls. Can J Anaesth 2007;54:1036. [ PUBMED] 
22.  Schulz KF, Grimes DA. Sample size calculations in randomised trials: Mandatory and mystical. Lancet 2005;365:134853. 
23.  Jones SR, Carley S, Harrison M. An introduction to power and sample size estimation. Emerg Med J 2003;20:4538. 
24.  Donner A. Approaches to sample size estimation in the design of clinical trials  A review. Stat Med 1984;3:199214. [ PUBMED] 
25.  Lwanga KS, Lemeshow S. Sample Size Determination in Health Studies: A Practical Manual. Geneva: WHO; 1992. 
26.  Campbell MJ, Julious SA, Altman DG. Estimating sample sizes for binary, ordered categorical, and continuous outcomes in two group comparisons. BMJ 1995;311:11458. 
27.  Parker AR, Bermman GN. Sample size: More than calculations. Am Stat 2003;57:16670. 
28.  Lenth VR. Some practical guidelines for effective sample size determination. Am Stat 2001;55:18793. 
29.  Eng J. Sample size estimation: How many individuals should be studied? Radiology 2003;227:30913. [ PUBMED] 
30.  Florey CD. Sample size for beginners. BMJ 1993;306:11814. [ PUBMED] 
31.  Woodward M. Sample size, power and minimum detectable relative risk in medical studies. J R Stat Soc Series D Stat 1992;41:18596. 
32.  Bacchetti P, Wolf LE, Segal MR, McCulloch CE. Ethics and sample size. Am J Epidemiol 2005;161:10510. 
