Issues in Data Collection and Statistical Data Analysis

In this blog, we will discuss the issues in Data Collection & Statistical Data Analysis.

Introduction

As it is known that Mathematics, Physics, Chemistry, Zoology and Botany are basic sciences, out of which ‘Mathematics’ is known as a ‘Language of Science’. Indian students are studying these basic sciences since their Vth standard. In 10+2+3 educational system, students have an option of selecting faculties after Xth . The present educational system provides an option of selecting specialization after XIIth standard. Some elementary concepts of Statistics are included in the curriculum of XIth and XIIth standard of commerce and science faculty. The rigorous instructional training as per the prescribed curriculum is available in some of the universities for graduation in science faculty. Usually Statistics is referred as a ‘Service Science’. The most common definition of Statistics is spelled out as, “An art as well as science of data collection, data presentation, data analysis and interpretation of results”.

Thus, according to this definition, data analysts or data scientists are expected to have skill and scientific approach for the four steps involved in the definition, viz;

1) Data collection
2) Data presentation
3) Data analysis and
4) Interpretation of results.

The present curriculum of Statistics and Instructional training thereof, gives exhaustive focus on data presentation (classification, tabulation, graphs, charts, diagrams, plots).

In Statistics, practical course curriculum also gives exposure of analyzing secondary data through most commonly used statistical tools (measures of central tendencies, measures of dispersion, correlation and regression, testing, designs of experiments, sampling and fitting of probability distributions including computations of probabilities).

The output of analyzed data is summarized as conclusions and it is further briefly interpreted. In short, statistics curriculum in general neglects the very first step of data collection and provides limited avenues of data analysis. Kenett. R. et.al. (2006), briefed out statistical consulting cycle and elaborated the steps of problem elicitation, data collection, formulation of findings and presentation of findings. The author of the paper skipped the discussion on data analysis. The present paper is an attempt to discuss various issues in data collection and data analysis. Prepared in cooperation with Elinext software development company.

In the complete process of statistical consultancy, data collection is the most basic and fundamental step. The relevant, consistent, accurate, complete and latest information or data is the backbone of the further steps. The process of data collection and instrument involved in it, is different, in different research domains. The survey methodology and use of questionnaire for data collection is most common n social sciences, management sciences, agriculture extension, marketing, media research.

Similarly, data analysis is a core part of statistical consultancy.
Data analysis is a three tier system:

1) Exploratory
2) Explanatory and
3) Confirmatory data analysis.

To come up with more meaningful, value added conclusions, data analyst have an option of analyzing the data using multivariate statistical tools. The conceptual background, scientific awareness and software literacy are the essential qualities of data analyst or data scientist. This section gives the brief discussion of important issues that may involve in overall statistical consultancy.

1. Important issues in data collection

The important issues involved in the data collection are as follows.

Deciding the possible target population units and population size.

Usually, in survey based research like management and social sciences, population is not closed, it may be highly scattered. Population units are difficult to identify. Population size is usually unknown and dynamically changing. The secondary sources (books, annual reports, project reports, journals, encyclopedia, magazines, internet, registers and paper clippings) may be trapped to get some inputs for this issue.

Cost of sampling, time required for sampling.

Sampling as such do not follow any typical pattern in survey based studies. Reaching to sampled units may be uncertain, even if reached, response behavior of sampled units is unpredictable. Therefore, it is quite difficult to come up with sampling cost and time scheduling. In such cases, researcher is advised to seek the guidance and expert opinion from the senior researchers or guide.

Choice of sampling method.

Theoretically, every sampling method has certain well define environment, constraints or boundaries. Such as, simple random sampling is more suitable when population is homogeneous. But decision regarding homogeneity or heterogeneity of population is highly subjective. There is no statistical instrument to declare that a mentioned population is homogeneous. Thus, choice of any sampling method has certain merits and demerits in a given situation. For the appropriate choice of the sampling method, the overall gist of literature review will be useful.

Decision about sample size.

Every sample size formula is based on some parameters like population size, variability in the population, level of accuracy, cost of sampling (direct and indirect cost), time required for sampling etc. It is quite difficult either to know the exact values of these parameters or to predict them, therefore, deciding the sample size is a crucial issue.
Looking in to these constraints, Krejcie et al. (1970), have given a ready reckoner table for sample size when population size is known.

Data collection instrument

Usually in survey based research, personal interview and questionnaire are the two major instruments used for data collection. The mechanism of personal interview is different than questionnaire. But in most of the surveys for data collection a suitable blend of interview and questionnaire is applied. Moreover, such a blend differs from respondent to respondent, hence this approach, has merits and demerits of both interview and questionnaire. In this case, it is suggested that researcher should understand the basic difference between personal interview and questionnaire, so that researcher may adopt proper technique for data collection.

Silent features of questionnaire

The most important features of any questionnaire are – Reliability, Validity, Objectivity, Sensitivity and Specificity. In psychology, questionnaire usually known as tests are developed for global studies. Therefore, these tests are normally standardized by its recursive application. The test developer psychologist takes proper care to ensure the features of questionnaire like reliability, validity, subjectivity, sensitivity and specificity. But in survey based studies, normally research topics are case study oriented, it may have less global applicability. Therefore, in these studies features of questionnaire are less cared, in fact researcher must give due attention to quantify these features and hence ensure standardized format of the proposed questionnaire.

Language of questionnaire, respondent and communication.

In many research studies, language of research proposal, research project and final research report is English, which is different from local language. For such a research, data may be expected to collect from the target population which is unaware about the language of research. In such cases researchers have to draft a questionnaire in research language first which then absolutely converted in local language and during field survey local language is used for communication and hence for data collection.

Possibility of non-response and chances of errors.

In survey based research studies while collecting the data through questionnaire, the issue in the minds of researchers may not get properly filtered to respondents that may bring non-response or errors. Ill understanding of question, incomplete questions may also generate non-response or errors. In many closed ended questions, options are ambiguous, irrelevant, overlapping that may also add up non-response or error. Questions relating to religious belief, religious traditions, emotions, income may also bring non-response or errors in survey. In order to minimize the rate of non-response or error, researcher should draft a questionnaire without any ambiguity, each question must be very clear in meaning, logical sequencing of questions, well defined options, unambiguous options, and non-overlapping options must be included in the questionnaire. Some kind of skill must be adopted to include sensitive questions. In short, if researcher takes utmost care of silent features of questionnaire, may help to reduce the rate of non-response. It is essential to bring confidence in respondent by declaring the confidentiality of the data collected.

Sampling experiences.

In social sciences, usually data is collected from individuals or households. Normally, research investigator plan to consent the target individuals during day time, may be between 8 a.m. to 6 p.m.. During this time span individuals are busy to get ready for their business or service, hence they may not spare the time for investigator. After 11 a.m. individuals are outside the home and hence investigator may meet to elder family members or children or women. It is a common observation that, in the absence of responsible male of the family, women avoid to give information. Even if male respondent is available then also there are chances of delay in response for which regular follow up becomes the essential part of the survey. If survey is based on some sensitive issues then there is common tendency of either non-response or incorrect response. In many cases it is also observed that respondents ask too many questions and doubts. If survey is related to women then in some cases on behalf of female, response is given by male participate. In the studies of management sciences if target population is from industrial sector then on the ground of confidentiality proper response may not receive. If approval mechanism is vertically hierarchical then also there are difficulties in conduction of survey.

Screening and scrutinizing the collected questionnaire.

In most of the surveys, respondents may keep some questions unanswered, it is called as non-response. Sometimes, it may be an observation that respondents gives multiple response to a question which expects exactly one response, such a response creates confusion or misunderstanding. A good questionnaire provides questions that brings counter check on responses. But while going through a response counter check indicates that response is irrelevant. In such a cases researcher should screen out and scrutinize the questionnaire for its relevance and completeness. It is further suggested that researcher should decide the sample size taking in to account such a non-response or irrelevant response. For such a decision the additional reading on literature review may help. This step will also become less important if utmost care of silent features of questionnaire is taken.

Standardizing the units of measurement (if required).

As we know that usually either CGS or MKS system of measurement is applied. But during the survey respondents may give response arbitrarily or mixed type or may be by some analog methods or in local terms. After collecting the data from all respondents, it is necessary to convert such a haphazard unit based data in standard common units of measurement. It is also suggested to follow, universal rules or universal guidelines for such a standardization. In case of larger numerical figures one must adopt accounting number system for easy understanding,

Coding structure of data items.

Encryption and decryption is the scientific methodology of developing unique, optimal sized codes for the features of the data. Coding helps in data entry, it also helps in data analysis and decoding will help to create software based tables and graphs. Considering the importance of coding structure, the researcher is advised to adopt certain, well defined coding structure. Before coding researcher must give a thought about the prospective software for analysis and visualization.

Choice of data entry worksheet

For data entry purpose, wide range of worksheets or spreadsheets are available. Such a range of spreadsheets include MS-Excel, MINITAB, MS-Access, MYSTAT, SYSTAT, Data frame in R, SPSS. Every spread sheet has its own merits and demerits, such as in MS-Excel every cell may have different data type, whereas in MS-Access one can fix the data type of a column as a whole. In MINITAB column created with text type data is not useful for analysis, whereas, SPSS provides variety of options to declare the data type of columns. Further, SPSS has an excellent ability of analyzing and visualizing the categorical data. Thus, the comparative study of these different spreadsheets will definitely help the researcher for selecting an appropriate spreadsheet for data entry.

Data validation and auditing.

Once the data is entered in selected spreadsheet, before data analysis it is necessary to carry out data validation. It includes checking the appropriate data in every column. It also includes to check the appropriate range of the data in the columns. Data validation will also ensure the correct data entry of categorical data. The case of categorical data entry in numerical column and vice a versa must be rectified through data validation. The cases like missing data may also be revisited through the process of data validation.

2. Important issues in data analysis and interpretation

The list of important issues involved in data analysis are as follows.

Choice of statistical software (MS-Excel, Minitab, SPSS, SAS, SYSTAT, R, MyStat).

Although, there are many statistical softwares, yet each software has its separate importance in the world of statistical analysis. Similarly, every software has certain limitations. For example MS-Excel is commonly available software along with its mobile app, it is easy to operate, yet it provides limited statistical utilities. Further, for data visualization, summarized data is expected. MS-Excel gives statistical output on the same sheet. Whereas, MINITAB is another statistical software that provides separate window for output, it also provides separate data visualization display. In MINITAB the default data visualization is in black and white mode, user can apply color and pattern to these graphs and charts to make Graphs more readable in nature. In SPSS, a software specially design for statistical analysis of data from social sciences, has many advantages and good features of software in general. It provides separate worksheet, separate output file, a large number of decorative features for graphs and charts. The output file can be exported to .doc or .pdf format. Graphs and charts can also be saved in different image formats. This software has an additional feature of providing titles to different tables, graphs, charts and diagrams. The output tables are very neat and appropriate to take as it is in reports. The remarkable feature of this software is to take proper care of missing data and give a report of missing data. Similarly many other softwares have certain merits and demerits. Thus, taking in to account the size of the data, dimensionality of the data, complexity of data structure, researcher has to take decision about the choice of appropriate software for data analysis.

Importing data from data entry worksheet to software.

As many softwares has worksheet or spreadsheet form of the data holding, it is necessary to have a skill of importing data from one platform to another. This enables saving the time and analyzing the data efficiently with different software.

Labeling, scaling and size of data items.

Particularly, SPSS gives an option of labeling, scaling, justification, coding-decoding and defining the size of the data item. This is very powerful feature of SPSS that generates neat and readable tables. These tables including charts are very easy to interpret and discuss the hidden features of the data.

Rearranging the data for particular analysis as per software requirement.

This is a typical requirement mostly in designs of experiment. Analysis practitioner may have k treatments to compare and hence k column data but softwares like MINITAB, SPSS needs to rearrange the data in two columns only, one for treatment code and other for observation. Thus rearranging the data in some particular cases is another skill based requirement in statistical consultancy.

Research objective based choice of statistical analysis tool.

A large number of parametric and non-parametric statistical analysis tools serves the purpose of a capsule, where practitioner needs to have diagnosis type of skill of applying appropriate statistical analysis tool that best suits the research objective. The derivations of many statistical tools is based on certain assumptions, practitioner is expected to test the validity of these assumptions and then take the appropriate decision of choice. Particularly in parametric situation the common assumption is normality, that can be tested by P-P or Q-Q plot that also helps to understand the nature of non-normal data and hence practitioner can apply certain transformations so as to convert non-normal data to normal data. Thus, the appropriateness of statistical toolbox to the typical data, will provide the consistent and research objective related proper outputs.

Selecting appropriate options available in software toolbox.

Many of the advanced statistical toolbox provides variety of options for particular tool. For example in descriptive statistics options available are mean, mode median, minimum, maximum, range, quartiles, deciles, percentiles, quantiles, standard deviation, variance, standard error, skewness, kurtosis, increasing or decreasing order arrangement and many others. Similarly, in correlation analysis available choices are Pearson’s coefficient of correlation, Spearman’s coefficient of correlation, Kendall’s Tau. Thus having availability of various options brings the responsibility to choose the appropriate option.

Reading and interpreting the output

The last and most crucial step in the data analysis is interpretation of results. Many softwares provides very extensive output in the form of table, diagram, charts and graphs. The researcher is expected to read the complete output thoroughly and extract the crux as disposed by the voluminous output. Interpretation is also very important in correlating the findings with objective of the study. At the last whatever the complex research is, the researcher express it in the form of a story, in which interpretation will help to give appropriate colors to the different role players.

References

Kenett R., Thyregod P., “Aspects of statistical consulting not taught by academia”, Statistica Neerlandica, 2006, 60 No. 3, 396-411.
Krejcie, Robert V., Morgan. Daryle W. , “Determining Sample size for Research Activities”. Educational and Psychological Measurements, 1970,30, 607-610.