## Blog: Statistics Research Principles and Terminologies

## A review of fundamental research principles, terminologies, and language of statistics.

In my previous blog, I intuitively introduced the world of statistics in the simplest language possible. I explained what is statistics, its importance and use cases with examples. We discovered why anecdotal evidence is not a good statistic to base conclusions of the population on. Although the previous blog is not a prerequisite, check it out if you haven’t already. In this blog, I will unravel the research principles, language, and terminologies from the world of statistics. So, without further ado, let’s get started.

**Population and Sample**

You’re finally ready to get your hands dirty in statistics. So, you decide to find out the average household income of middle-class families in your country. The data you will collect shall include each and every middle-class family from your country. The middle-class category should be properly defined. It shall contain families of different ethnicities, races, religions, creeds, etc. Because this data contains every member of the group you’re interested in, it will be considered as a **population**. Thus, a population is an individual or group that represents all the members of a certain group. The number of items in a population could even be 1 or in billions.

A **sample** is when you select a few of the middle-class families to infer (conclude) the average household income of the population based on the assumption that your sample will generalize well to the population. Thus, a sample is a subset of a larger population. Statistics is the study of sample data, its analysis, representation, and interpretation.

Why would you want to work on a sample instead of a population? Because it is often not possible to collect data of every member of a population. A sample, however, is much easier to contact. It is less time consuming and less costly. We prefer drawing samples, compared to analyzing an entire population only because of time and resources.

**Parameters and Statistics**

The difference between parameters and statistics might be new for many as it was for me. If you calculate the average household income of your population data, what you generate is known as a **parameter**. A parameter is a value generated from, or applied to, a population.

Whereas if you calculate the same for your sample data, you get a **statistic**. Statistics are values derived from sample data. Thus, depending on the data, either a population or a sample over which we apply a mean function, what we get is either a parameter or a statistic respectively.

**Descriptive and Inferential Statistics**

Suppose we collected a sample data of yearly income of data scientists in the USA. When we’re only interested in simply describing the characteristics of our data (sample/population), we resort to **descriptive statistics**. Descriptive statistics apply only to the members of a sample or population from which data have been collected. We can calculate the mean or median average salary of a data scientist. We can calculate what is the level of variation in salaries and maybe generate a distribution plot. For the most part, we are much more concerned with our data in hand.

Can we generalize the result of our sample to the larger population? Yes! That’s where **inferential statistics** comes into the picture. Inferential statistics refer to the use of sample data to reach some conclusion (i.e., make some inference) about the characteristics of the larger population that the sample is supposed to represent. To make the leap from sample data to inferences about a population, one must be very clear about whether the sample accurately represents the population. Thus, an important first step is to clearly define the population that the sample is alleged to represent.

**Sampling Methods**

Sampling is the process of drawing out a sample from its population. We have a number of ways at our disposal to select samples. I’ll intuitively explain three of the most popular methods.

Scenario #1: I want to conduct a survey to determine how satisfied students of a particular university are. I walk into the university and find an overwhelming number of students all around the campus. They along with the absent students and students enrolled in distance learning represent my population. There’s absolutely no way I can reach all of them out so I enter the cafeteria wherein 120 students present over there participated in my survey. This is an example of **random sampling**. In this process, every member of a population has an equal chance of being selected into a sample. A major benefit of random sampling is that any differences between the sample and the population from which the sample is selected will not be due to systematic bias in the selection process but due to chance. A larger sample tends to represent a population well.

Scenario #2: Suppose 35% of my population data of middle-class families of India earn their yearly income through a business, I would try to match the percentage of families that owns a business in my sample. Similarly, if 10% of the population have more than 6 members in their family, 10% of my sample should have more than 6 members in their family. This is known as **representative sampling** where I am purposely selecting cases so that my sample match the larger population on specific characteristics. This is a costly and time-consuming but increases my chances of being able to generalize the results from my sample to the population.

Scenario #3: I want to do a statistical study on the level of fitness of 10th-grade students so I select a sample of 200 students from the nearest high school to my residence. This method of selecting samples is called **convenience sampling**. In this method, the researcher generally selects participants on the basis of proximity, ease-of-access, and willingness to participate. It is not at all a bad method if my sample does not differ from my population of interest in ways that influence the outcome of the study. It is definitely less time consuming and convenient.

**Types of Variables**

Given data of 10th-grade students of a country, a student’s age, height, weight, gender, attitude about school, etc. are all known as **variables**. Anything that could be codified and contains more than a single score (value) is a variable. A **constant**, in contrast, is a variable that only has a single score. For example, “age” will be treated as a constant if it is 15 years for all the students in our sample data. A variable could be **quantitative** (**continuous**) or **qualitative** (**categorical**). For example, “height”, “weight” and “age” are quantitative variables whereas “gender” and “attitude about school” are qualitative variables. A quantitative variable indicates some sort of amount. You can perform mathematical operations over them such as average or mean. A qualitative variable, on the other hand, does not indicate more or less of a certain quality. For example, “gender” variable contains two scores/values — “male” and “female”. One score is not more or less than the other score. You cannot apply mathematical operations over them. The scores only represent a qualitative difference.

A **dichotomous variable** is a qualitative variable with only two different scores (e.g., “gender” variable).

**Scales of Measurement**

There are four different scales of measurement for variables in statistics. A **nominally scaled variable** is one in which the labels that are used to identify the different levels of the variable have no weight or a numeric value. From our previous example of sample data, “gender” is a nominal variable. Even if its scores — “male” and “female” are encoded into 0 and 1 for conducting statistics using computer software, a value of 1 does not indicate a higher score than a value of 0. They are simply labels assigned to each group.

Suppose I collected fan ratings on Avengers Endgame over a scale of 1 to 10 where 1 represents completely dissatisfied and 10 represents overly satisfied. A score of 9 tells me that a fan enjoyed the movie far more than someone who rated 3. The scores do have weight. They just don’t tell me a measurable difference of satisfaction between a rating of 9 and 10. Such variables are known as **ordinal variables**. This type of variable fail to answer ‘how much more a score is greater or less than the other (in terms of a measurable quantity)?’.

The third and fourth kind of scales of measurement for variables is **intervals** and **ratios**. They contain information about both relative value and distance. For example, “height”. If one member of my sample is 170 cm tall, another is 173 cm tall, and a third is 166 cm tall, I know who is tallest and how much taller or shorter each member of my sample is in relation to the others. Whenever a variable is measured using a scale of equal intervals, they fall among these two groups. The difference between intervals and ratios comes into picture when we talk about they treat a zero value. In the case of intervals, zero does not mean “nothing”. For example, the “year” variable may contain a year zero which has a meaning with regard to time. The same goes for temperature in degrees Celsius: zero degrees is not “nothing” with regard to temperature. Ratio scales also include a zero value which means “nothing”. For example, the “weight” variable in kilos. Zero kilos corresponds to “nothing” with regard to weight. Ratio variables may even hold negative values. For example, “bank account balance” variable.

**Research Designs**

Now that we’ve covered the important terminologies and concepts of statistics, we’ll dive into research designs and methodologies employed by statisticians. This section will provide a sneak peek of how statistics is actually leveraged in the real world.

Suppose you’re an online tutor and your daily business includes sending across promotional content for your online courses to your mailing list. You believe your audience is not opening your mail lest reading it may be due to the subject of the mail. Thus, you’re looking for a new strategy to reduce your audience’s churn rate. What you could use to understand your user behavior and response to reduce churn is **experimental design**. You can prepare two different subjects for your promotional mail and check which one works better by dividing your mailing list into two samples using random sampling. You should send a mail with subject A to group A and subject B to group B. Any differences between the sample and population caused due to random assignment is pure chance. You finally find out after experimentation that subject B garnered customer responses. Now you shall use subject B for your emails to your overall mailing list to drive your business to success. With the experimental design, researchers can isolate specific **independent variables** that may cause variation in **dependent variables**. Since customer response depended upon the mail subject, it is a dependent variable whereas mail subject is an independent variable.

There’s yet another type of research methodology where participants are not divided into groups. Researchers do not manipulate the data. They collect the data on several variables and then determine how strongly different variables are related to each other using statistical analyses. This is known as a **correlational research design**. For example, you may be interested in determining whether watching violence on television causes violent behavior in adolescence. Using correlational research design, you established that indeed there’s a positive relationship between those two variables. But correlation is not and cannot be taken to imply causation. It simply shows the relationship. It could be possible that watching violence on television causes violent behavior but the opposite could be concluded as well. It could also be that the cause of both these is a third variable — say, for example, growing up in a violent neighborhood or home — and that both the watching violence on television and the violent behavior are the outcome of this. Correlational research designs are easier to conduct and allow researchers to examine many variables simultaneously. The primary drawback is that such research does not allow for the careful controls necessary for drawing conclusions about causal associations between variables.

**Summary**

We covered a lot of statistical terminologies and research principles in this blog. We saw the differences between a population and a sample, parameters and statistics, descriptive and inferential statistics. We understood the process of sampling and how a sample can be drawn from a population using three popular methods — random sampling, representative sample, and convenience sampling.

Variables are of different types and could be measured using varying scales. They are broadly classified as quantitative or qualitative based on the scores that they contain. Finally, we saw through examples of how statisticians and researchers use different statistical research design to conduct real-world analyses. In my next blog post, I’ll dive into our first set of statistics — measures of central tendency. As we all know, they’re the most popular and widely used statistics and does a fairly good job summarizing data. Thanks for reading. Hope it helps.

*Source: Artificial Intelligence on Medium*