Outliers are data points that lie an abnormal amount outside of the rest of the values in a certain data set. Discover how, as a statistician or data analyst, you might use several methods to help determine whether a certain value is an outlier.
![[Featured Image] A smiling data scientist analyzes data featuring outliers at their desk and has graphs displayed on their computer monitors.](https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://images.ctfassets.net/wp1lcwdav1p1/4Y7SQ7LAmQcoP16uW3YFlo/c8feb1d385318b6ef637bb2a230eb6ba/GettyImages-1751833779.jpg?w=1500&h=680&q=60&fit=fill&f=faces&fm=jpg&fl=progressive&auto=format%2Ccompress&dpr=1&w=1000)
Outliers differ sharply from other data points. They can distort analysis or uncover rare, valuable insights when handled thoughtfully.
A Z-score measures how far a data point deviates from the mean in a normal distribution. Typically, you can consider any value with a Z-score above three an outlier.
Effective outlier detection requires choosing the right method depends on your data’s shape, source, and purpose.
You can detect and manage outliers to improve data accuracy, uncover rare insights, and reduce bias in your analysis.
Explore what outliers are, the role they play in data analytics, methods that you can use to define outliers, and how to deal with outliers once you identify them. If you're ready to launch your data science career, consider enrolling in IBM's Data Science Professional Certificate. With no prior experience necessary, you can learn essential, job-ready skills, as well as in-demand AI expertise in as little as four months.
Outliers are data points that lie outside the majority of the data in a particular data set. These values might be much higher or lower in value than other points and may impact the results of the data analysis in ways that misrepresent the data sample. By learning how to identify and handle outliers, data analysts can increase the likelihood that their analysis will accurately reflect the validity and reliability of their results.
Outliers play an important role in data analytics, varying depending on the origin and impact of the analysis. For example, in some fields, outliers may provide insight into rare occurrences, indicating the need for further analysis. In the health care industry, an outlier data point may represent someone with an abnormal set of symptoms or recovery pattern. This could indicate that you should explore further, such as looking at patients with similar characteristics to see potential outcomes.
In other cases, outliers may represent sources of error. Measurement inaccuracies, typos, or other factors may introduce noise into the data set that does not represent the actual data. The presence of outliers in data sets may also signal low data quality and introduce bias into your analysis. If there were systematic errors during data collection, you would have to make an informed decision on how best to proceed.
You can find outliers in data through several detection methods. You may choose several methods depending on your role and the purpose of outlier detection. Some of the methods you can choose include:
By sorting your data into ascending or descending order, it may become apparent that certain data points are much higher or lower than others. For example, if you had the data set:
1, 1, 3, 4, 5, 5, 102 You would likely determine that 102 is an outlier. You would then examine the data points more closely to identify the source of the outlier data point.
Another way to determine whether you have outliers in your data set is to visualize the data. You can do this by graphing your data set. You can choose any graphical representation that suits you, but scatter plots and histograms are two common choices to identify outliers.
Histograms display data in “bins” that represent segments of the data. Each bin represents how many data points are at a specific value or fall within a range of values. This can show you when a data point is far out of range. For example, if you have tall bins between the values of 10 and 30 and then a short bin at a value of 200, you might look more closely at the 200 value.
Scatter plots plot values on a standard graph with an x-axis and a y-axis. This showcases outliers by grouping the majority of the points in a cluster. If one point is much different from the rest of the cluster, this indicates an outlier.
Read more: Data Visualization: Definition, Benefits, and Examples
Assessing the interquartile range (IQR) of a data set is another way to detect outliers. You calculate the IQR by subtracting the first quartile (Q1) value from the third quartile (Q3) value. You can visualize this through boxplots, which you draw by creating a box along a y-axis. The bottom of the box is the value of the first quartile, and the top of the box is the value of the third quartile of the data.
In the data set, 25 percent will fall below the first quartile (Q1), and 75 percent will fall below the third quartile (Q3). Outliers are often defined as values that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).
For data that follows a normal distribution, Z-scores can be one way to find how far away a data point is from the mean of the data set. A normal distribution indicates that the data follows a bell-shaped curve. The Z-score is the number of standard deviations (a measure of variance) away from the mean a point lies. In most cases, a score of over three indicates an outlier. Before choosing this method as your form of outlier detection, it’s important that you test to ensure that your data follows a normal distribution. When your data follows a normal distribution, 68 percent of the data points will lie within 1 standard deviation of the mean, and 95 percent will lie between 2 standard deviations of the mean.
A student who scores 100 percent on an exam, while the majority of their peers receive scores between 50 percent and 70 percent, is a basic example of an outlier. This 100 percent result stands well beyond the typical range and may reflect exceptional ability, a scoring error, or a different test version. Similar to the wide performance gaps seen in national assessments, such outliers can reveal rare insights or inconsistencies, helping analysts refine their methods and improve the accuracy of their interpretations.
After you identify outliers in your data set, the next step will be to determine how best to deal with these outliers. To do this, you can consider several options:
Remove or correct outliers: If you find that the outliers are from measurement errors, you may benefit from removing them from the data set or correcting them if possible. However, you should do this carefully to prevent bias or sample misrepresentation.
Apply data transformations: Logarithmic, square root, or inverse transformations can help reduce outliers' influence on the analysis. Transformations such as these often stabilize the variances of the data and make them more suitable for certain statistical tests.
Use robust statistical methods: Using methods for your analysis that are less sensitive to outliers, like choosing the median of your data set instead of the mean, can lead to more reliable results without the need to remove outliers.
To stay current on trends in the data analytics space, consider joining our LinkedIn newsletter Career Chat. You can also learn more through our free resources:
Here from a pro: 7 Questions With a Data Analytics Professor
Bookmark for later: Data Analysis Terms & Definitions
Accelerate your career growth with a Coursera Plus subscription. When you enroll in either the monthly or annual option, you’ll get access to over 10,000 courses.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.