Lorem ipsum dolor sit amet, consectetur adicing elit ut ullamcorper. leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet. Leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet.

/  Project   /  Blog: Statistics Research Principles and Terminologies

## A review of fundamental research principles, terminologies, and language of statistics.

In my previous blog, I intuitively introduced the world of statistics in the simplest language possible. I explained what is statistics, its importance and use cases with examples. We discovered why anecdotal evidence is not a good statistic to base conclusions of the population on. Although the previous blog is not a prerequisite, check it out if you haven’t already. In this blog, I will unravel the research principles, language, and terminologies from the world of statistics. So, without further ado, let’s get started.

#### Population and Sample

You’re finally ready to get your hands dirty in statistics. So, you decide to find out the average household income of middle-class families in your country. The data you will collect shall include each and every middle-class family from your country. The middle-class category should be properly defined. It shall contain families of different ethnicities, races, religions, creeds, etc. Because this data contains every member of the group you’re interested in, it will be considered as a population. Thus, a population is an individual or group that represents all the members of a certain group. The number of items in a population could even be 1 or in billions.

A sample is when you select a few of the middle-class families to infer (conclude) the average household income of the population based on the assumption that your sample will generalize well to the population. Thus, a sample is a subset of a larger population. Statistics is the study of sample data, its analysis, representation, and interpretation.

Why would you want to work on a sample instead of a population? Because it is often not possible to collect data of every member of a population. A sample, however, is much easier to contact. It is less time consuming and less costly. We prefer drawing samples, compared to analyzing an entire population only because of time and resources.

#### Parameters and Statistics

The difference between parameters and statistics might be new for many as it was for me. If you calculate the average household income of your population data, what you generate is known as a parameter. A parameter is a value generated from, or applied to, a population.

Whereas if you calculate the same for your sample data, you get a statistic. Statistics are values derived from sample data. Thus, depending on the data, either a population or a sample over which we apply a mean function, what we get is either a parameter or a statistic respectively.

#### Descriptive and Inferential Statistics

Suppose we collected a sample data of yearly income of data scientists in the USA. When we’re only interested in simply describing the characteristics of our data (sample/population), we resort to descriptive statistics. Descriptive statistics apply only to the members of a sample or population from which data have been collected. We can calculate the mean or median average salary of a data scientist. We can calculate what is the level of variation in salaries and maybe generate a distribution plot. For the most part, we are much more concerned with our data in hand.

Can we generalize the result of our sample to the larger population? Yes! That’s where inferential statistics comes into the picture. Inferential statistics refer to the use of sample data to reach some conclusion (i.e., make some inference) about the characteristics of the larger population that the sample is supposed to represent. To make the leap from sample data to inferences about a population, one must be very clear about whether the sample accurately represents the population. Thus, an important first step is to clearly define the population that the sample is alleged to represent.

#### Sampling Methods

Sampling is the process of drawing out a sample from its population. We have a number of ways at our disposal to select samples. I’ll intuitively explain three of the most popular methods.

Scenario #1: I want to conduct a survey to determine how satisfied students of a particular university are. I walk into the university and find an overwhelming number of students all around the campus. They along with the absent students and students enrolled in distance learning represent my population. There’s absolutely no way I can reach all of them out so I enter the cafeteria wherein 120 students present over there participated in my survey. This is an example of random sampling. In this process, every member of a population has an equal chance of being selected into a sample. A major benefit of random sampling is that any differences between the sample and the population from which the sample is selected will not be due to systematic bias in the selection process but due to chance. A larger sample tends to represent a population well.

Scenario #2: Suppose 35% of my population data of middle-class families of India earn their yearly income through a business, I would try to match the percentage of families that owns a business in my sample. Similarly, if 10% of the population have more than 6 members in their family, 10% of my sample should have more than 6 members in their family. This is known as representative sampling where I am purposely selecting cases so that my sample match the larger population on specific characteristics. This is a costly and time-consuming but increases my chances of being able to generalize the results from my sample to the population.

Scenario #3: I want to do a statistical study on the level of fitness of 10th-grade students so I select a sample of 200 students from the nearest high school to my residence. This method of selecting samples is called convenience sampling. In this method, the researcher generally selects participants on the basis of proximity, ease-of-access, and willingness to participate. It is not at all a bad method if my sample does not differ from my population of interest in ways that influence the outcome of the study. It is definitely less time consuming and convenient.

#### Types of Variables

Given data of 10th-grade students of a country, a student’s age, height, weight, gender, attitude about school, etc. are all known as variables. Anything that could be codified and contains more than a single score (value) is a variable. A constant, in contrast, is a variable that only has a single score. For example, “age” will be treated as a constant if it is 15 years for all the students in our sample data. A variable could be quantitative (continuous) or qualitative (categorical). For example, “height”, “weight” and “age” are quantitative variables whereas “gender” and “attitude about school” are qualitative variables. A quantitative variable indicates some sort of amount. You can perform mathematical operations over them such as average or mean. A qualitative variable, on the other hand, does not indicate more or less of a certain quality. For example, “gender” variable contains two scores/values — “male” and “female”. One score is not more or less than the other score. You cannot apply mathematical operations over them. The scores only represent a qualitative difference.

A dichotomous variable is a qualitative variable with only two different scores (e.g., “gender” variable).

#### Scales of Measurement

There are four different scales of measurement for variables in statistics. A nominally scaled variable is one in which the labels that are used to identify the different levels of the variable have no weight or a numeric value. From our previous example of sample data, “gender” is a nominal variable. Even if its scores — “male” and “female” are encoded into 0 and 1 for conducting statistics using computer software, a value of 1 does not indicate a higher score than a value of 0. They are simply labels assigned to each group.

Suppose I collected fan ratings on Avengers Endgame over a scale of 1 to 10 where 1 represents completely dissatisfied and 10 represents overly satisfied. A score of 9 tells me that a fan enjoyed the movie far more than someone who rated 3. The scores do have weight. They just don’t tell me a measurable difference of satisfaction between a rating of 9 and 10. Such variables are known as ordinal variables. This type of variable fail to answer ‘how much more a score is greater or less than the other (in terms of a measurable quantity)?’.

The third and fourth kind of scales of measurement for variables is intervals and ratios. They contain information about both relative value and distance. For example, “height”. If one member of my sample is 170 cm tall, another is 173 cm tall, and a third is 166 cm tall, I know who is tallest and how much taller or shorter each member of my sample is in relation to the others. Whenever a variable is measured using a scale of equal intervals, they fall among these two groups. The difference between intervals and ratios comes into picture when we talk about they treat a zero value. In the case of intervals, zero does not mean “nothing”. For example, the “year” variable may contain a year zero which has a meaning with regard to time. The same goes for temperature in degrees Celsius: zero degrees is not “nothing” with regard to temperature. Ratio scales also include a zero value which means “nothing”. For example, the “weight” variable in kilos. Zero kilos corresponds to “nothing” with regard to weight. Ratio variables may even hold negative values. For example, “bank account balance” variable.

#### Research Designs

Now that we’ve covered the important terminologies and concepts of statistics, we’ll dive into research designs and methodologies employed by statisticians. This section will provide a sneak peek of how statistics is actually leveraged in the real world.