##### BI, Analytics & Big DataTutorials

# Estimating population statistics with Point Estimation

*This article is an extract from the book*

*Principles of Data Science*

*,**written by Sinan Ozdemir. The book is a great way to get into the field of data science. It takes a unique approach that bridges the gap between mathematics and computer science, taking you through the entire data science pipeline.*

In this extract, we’ll learn how to estimate population means, variances and other statistics using the Point Estimation method. For the code samples, we’ve used Python 2.7.

A point estimate is an estimate of a population parameter based on sample data. To obtain these estimates, we simply apply the function that we wish to measure for our population to a sample of the data.

For example, suppose there is a company of 9,000 employees and we are interested in ascertaining the average length of breaks taken by employees in a single day. As we probably cannot ask every single person, we will take a sample of the 9,000 people and take a mean of the sample. This sample mean will be our point estimate.

The following code is broken into three parts:

We will use the probability distribution, known as the Poisson distribution, to randomly generate 9,000 answers to the question: for how many minutes in a day do you usually take breaks? This will represent our “population”.

We will take a sample of 100 employees (using the Python random sample method) and find a point estimate of a mean (called a sample mean).

Compare our sample mean (the mean of the sample of 100 employees) to our population mean.

Let’s take a look at the following code:

```
np.random.seed(1234)
long_breaks = stats.poisson.rvs(loc=10, mu=60, size=3000)
# represents 3000 people who take about a 60 minute break
```

The long_breaks variable represents 3000 answers to the question: *how many minutes on an average do you take breaks for?*, and these answers will be on the longer side. Let’s see a visualization of this distribution, shown as follows:

`pd.Series(long_breaks).hist()`

We see that our average of 60 minutes is to the left of the distribution. Also, because we only sampled 3000 people, our bins are at their highest around **700-800** people.

Now, let’s model 6000 people who take, on an average, about 15 minutes’ worth of breaks. Let’s again use the Poisson distribution to simulate 6000 people, as shown:

```
short_breaks = stats.poisson.rvs(loc=10, mu=15, size=6000)
# represents 6000 people who take about a 15 minute break
pd.Series(short_breaks).hist()
```

Okay, so we have a distribution for the people who take longer breaks and a distribution for the people who take shorter breaks. Again, note how our average break length of 15 minutes falls to the left-hand side of the distribution, and note that the tallest bar is about 1600 people.

```
breaks = np.concatenate((long_breaks, short_breaks))
# put the two arrays together to get our "population" of 9000 people
```

The breaks variable is the amalgamation of all the 9000 employees, both long and short break takers. Let’s see the entire distribution of people in a single visualization:

`pd.Series(breaks).hist()`

We see how we have two humps. On the left, we have our larger hump of people who take about a 15 minute break, and on the right, we have a smaller hump of people who take longer breaks. Later on, we will investigate this graph further.

We can find the total average break length by running the following code:

```
breaks.mean()
# 39.99 minutes is our parameter
```

Our average company break length is about 40 minutes. Remember that our population is the entire company’s employee size of 9,000 people, and *our parameter is 40 minutes*. In the real world, our goal would be to estimate the population parameter because we would not have the resources to ask every single employee in a survey their average break length for many reasons. Instead, we will use a point estimate.

So, to make our point, we want to simulate a world where we ask 100 random people about the length of their breaks. To do this, let’s take a random sample of 100 employees out of the 9,000 employees we simulated, as shown:

```
sample_breaks = np.random.choice(a = breaks, size=100)
# taking a sample of 100 employees
```

Now, let’s take the mean of the sample and subtract it from the population mean and see how far off we were:

```
breaks.mean() - sample_breaks.mean()
# difference between means is 4.09 minutes, not bad!
```

This is extremely interesting, because with only about 1% of our population (100 out of 9,000), we were able to get within 4 minutes of our population parameter and get a very accurate estimate of our population mean. Not bad!

Here, we calculated a point estimate for the mean, but we can also do this for proportion parameters. By proportion, I am referring to a ratio of two quantitative values.

Let’s suppose that in a company of 10,000 people, our employees are 20% white, 10% black, 10% Hispanic, 30% Asian, and 30% identify as other. We will take a sample of 1,000 employees and see if their race proportions are similar.

```
employee_races = (["white"]*2000) + (["black"]*1000) +\
(["hispanic"]*1000) + (["asian"]*3000) +\
(["other"]*3000)
```

employee_races represents our employee population. For example, in our company of 10,000 people, 2,000 people are white (20%) and 3,000 people are Asian (30%).

Let’s take a random sample of 1,000 people, as shown:

```
demo_sample = random.sample(employee_races, 1000) # Sample 1000 values
for race in set(demo_sample):
print( race + " proportion estimate:" )
print( demo_sample.count(race)/1000. )
The output obtained would be as follows:
hispanic proportion estimate:
0.103
white proportion estimate:
0.192
other proportion estimate:
0.288
black proportion estimate:
0.1
asian proportion estimate:
0.317
```

We can see that the race proportion estimates are very close to the underlying population’s proportions. For example, we got 10.3% for Hispanic in our sample and the population proportion for Hispanic was 10%.

To summarize we can say that you’re familiar with point estimation method to estimate population means, variances and other statistics, and implement them in Python.

*If you found our post useful, you can check out **Principles of Data Science* *for more interesting Data Science tips and techniques.*