Working with Duplicate Observations using `tag_duplicates`
tag_duplicates.Rmd
This tutorial provides an overview and detailed usage examples for
the tag_duplicates
function in the mStats
package. It is designed to identify and tag duplicate observations based
on specified variables. It mimics the functionality of
Stata
’s duplicates command in R. It provides a report of
duplicates and creates a tibble with three columns: .n_
,
.N_
, and .dup_
.
library(mStats)
#>
#> Attaching package: 'mStats'
#> The following objects are masked from 'package:base':
#>
#> append, cut
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Example with a custom dataset
data <- data.frame(
x = c(1, 1, 2, 2, 3, 4, 4, 5),
y = letters[1:8]
)
# Identify and tag duplicates based on the x variable
data %>% mutate(tag_duplicates(x))
#> $ Report of duplicates
#> in terms of x
#> copies observations surplus
#> 1 2 0
#> 2 6 3
#> x y .n_ .N_ .dup_
#> 1 1 a 1 2 FALSE
#> 2 1 b 2 2 TRUE
#> 3 2 c 1 2 FALSE
#> 4 2 d 2 2 TRUE
#> 5 3 e 1 1 FALSE
#> 6 4 f 1 2 FALSE
#> 7 4 g 2 2 TRUE
#> 8 5 h 1 1 FALSE
# Identify and tag duplicates based on multiple variables
data %>% mutate(tag_duplicates(x, y))
#> $ Report of duplicates
#> in terms of x
#> copies observations surplus
#> 1 8 0
#> x y .n_ .N_ .dup_
#> 1 1 a 1 1 FALSE
#> 2 1 b 1 1 FALSE
#> 3 2 c 1 1 FALSE
#> 4 2 d 1 1 FALSE
#> 5 3 e 1 1 FALSE
#> 6 4 f 1 1 FALSE
#> 7 4 g 1 1 FALSE
#> 8 5 h 1 1 FALSE
# Identify and tag duplicates based on all variables
data %>% mutate(tag_duplicates(everything()))
#> $ Report of duplicates
#> in terms of all variables
#> copies observations surplus
#> 1 8 0
#> x y .n_ .N_ .dup_
#> 1 1 a 1 1 FALSE
#> 2 1 b 1 1 FALSE
#> 3 2 c 1 1 FALSE
#> 4 2 d 1 1 FALSE
#> 5 3 e 1 1 FALSE
#> 6 4 f 1 1 FALSE
#> 7 4 g 1 1 FALSE
#> 8 5 h 1 1 FALSE
In the examples above, we use the tag_duplicates
function to identify and tag duplicates in the data dataframe. We can
specify one or multiple variables to check for duplicates. By using
everything(), we can check for duplicates based on all variables in the
dataframe.
The function provides a concise report, showing the number of
duplicates and surplus observations for each group of variables. The
.dup_
column indicates whether an observation is a
duplicate or not.
Report of duplicates
The report of duplciates
provides information about
duplicate observations in a dataset and how they are interpreted. It
consists of three components: copies
,
observations
, and surplus.
-
Copies
: The copies column indicates the number of times a set of observations is duplicated. In other words, it shows the number of identical rows present in the dataset. -
Observations
: The observations column represents the total count of unique observations in the dataset, including both original and duplicated rows. -
Surplus
: The surplus column denotes the number of additional observations beyond the first occurrence of each duplicated set. It indicates the number of duplicate rows excluding the original row.
To interpret the duplicates report in the context of the
tag_duplicates
R function, consider the following:
-
Copies
: If the copies value is 1, it means there are no duplicates for that specific set of variables. If it is greater than 1, it indicates the number of duplicate sets present. -
Observations
: The observations count includes all unique rows in the dataset, including both original and duplicate rows. This count helps identify the total number of distinct observations. -
Surplus
: The surplus value represents the number of duplicate rows beyond the first occurrence of each set. These surplus rows need to be further examined or potentially removed.
By analyzing the duplicates report, you can identify which variables or combinations of variables are causing duplicates in the dataset. This information can guide further investigation or data cleaning steps to handle duplicate observations appropriately.
Appended columns: .n_
, .N_
, and
.dup_
When using the tag_duplicates function in R, the function returns
three columns: .n_
, .N_
, and
.dup_
. Here’s an explanation of how to interpret these
columns:
.n_
The .n_
column represents the count of each observation
in the dataset, including duplicates. It indicates the number of times
each row appears in the dataset, regardless of whether it is a duplicate
or unique. For unique observations, the value in this column will be 1.
If an observation is duplicated multiple times, the value will be
greater than 1. Interpretation of .n_
:
If .n_
equals 1, it means the observation is unique and
does not have any duplicates in the dataset. If .n_
is
greater than 1, it indicates that the observation is duplicated and
appears multiple times in the dataset.
.N_
The .N_
column represents the count of unique
observations. It provides the number of unique occurrences for each
observation in the dataset, considering both duplicates and unique rows.
Each row in the dataset is assigned the count of its unique occurrence.
Interpretation of .N_
:
The .N_
column helps identify the total number of unique
occurrences for each observation in the dataset, regardless of whether
it is a duplicate or unique. If .N_
equals 1, it means the
observation is unique and has only one occurrence in the dataset. If
.N_
is greater than 1, it indicates that the observation
has duplicates and appears multiple times in the dataset.
.dup_
The .dup_
column is a logical indicator that flags
whether an observation is a duplicate or not. It assigns a value of
TRUE
if the observation is a duplicate and FALSE if it is
unique. Interpretation of .dup_
:
If .dup_
is TRUE
, it means the observation
is a duplicate and appears multiple times in the dataset. If
.dup_
is FALSE
, it indicates that the
observation is unique and does not have any duplicates.
By examining these three columns together, you can determine which observations in your dataset are duplicates, how many times they occur, and whether an observation is unique or duplicated. This information can be valuable for further data analysis, quality control, or data cleaning processes.
UCLA STATA duplicates Tutorial in R
Read more about the tutorial here: https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/
hsb2 <- haven::read_dta("https://stats.idre.ucla.edu/stat/stata/notes/hsb2.dta")
hsb2_mod <- hsb2 |>
dplyr::select(id, female, ses, read, write, math) |>
dplyr::arrange(id)
hsb2_mod2 <- hsb2_mod |>
dplyr::filter(dplyr::row_number() <= 5) |>
dplyr::bind_rows(hsb2_mod) |>
dplyr::arrange(id) |>
dplyr::mutate(math = ifelse(dplyr::row_number() == 1, 84, math))
hsb2_mod2 |>
dplyr::mutate(tag_duplicates(everything()))
#> $ Report of duplicates
#> in terms of all variables
#> copies observations surplus
#> 1 197 0
#> 2 8 4
#> # A tibble: 205 × 9
#> id female ses read write math .n_ .N_ .dup_
#> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <int> <int> <lgl>
#> 1 1 1 [female] 1 [low] 34 44 84 1 1 FALSE
#> 2 1 1 [female] 1 [low] 34 44 40 1 1 FALSE
#> 3 2 1 [female] 2 [middle] 39 41 33 1 2 FALSE
#> 4 2 1 [female] 2 [middle] 39 41 33 2 2 TRUE
#> 5 3 0 [male] 1 [low] 63 65 48 1 2 FALSE
#> 6 3 0 [male] 1 [low] 63 65 48 2 2 TRUE
#> 7 4 1 [female] 1 [low] 44 50 41 1 2 FALSE
#> 8 4 1 [female] 1 [low] 44 50 41 2 2 TRUE
#> 9 5 0 [male] 1 [low] 47 40 43 1 2 FALSE
#> 10 5 0 [male] 1 [low] 47 40 43 2 2 TRUE
#> # ℹ 195 more rows
hsb2_mod2 |>
dplyr::mutate(tag_duplicates(id))
#> $ Report of duplicates
#> in terms of id
#> copies observations surplus
#> 1 195 0
#> 2 10 5
#> # A tibble: 205 × 9
#> id female ses read write math .n_ .N_ .dup_
#> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <int> <int> <lgl>
#> 1 1 1 [female] 1 [low] 34 44 84 1 2 FALSE
#> 2 1 1 [female] 1 [low] 34 44 40 2 2 TRUE
#> 3 2 1 [female] 2 [middle] 39 41 33 1 2 FALSE
#> 4 2 1 [female] 2 [middle] 39 41 33 2 2 TRUE
#> 5 3 0 [male] 1 [low] 63 65 48 1 2 FALSE
#> 6 3 0 [male] 1 [low] 63 65 48 2 2 TRUE
#> 7 4 1 [female] 1 [low] 44 50 41 1 2 FALSE
#> 8 4 1 [female] 1 [low] 44 50 41 2 2 TRUE
#> 9 5 0 [male] 1 [low] 47 40 43 1 2 FALSE
#> 10 5 0 [male] 1 [low] 47 40 43 2 2 TRUE
#> # ℹ 195 more rows
hsb2_mod2 |>
dplyr::mutate(tag_duplicates(id, .add_tags = TRUE)) |>
# filter duplicate observations
dplyr::filter(.dup_)
#> $ Report of duplicates
#> in terms of id
#> copies observations surplus
#> 1 195 0
#> 2 10 5
#> # A tibble: 5 × 9
#> id female ses read write math .n_ .N_ .dup_
#> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <int> <int> <lgl>
#> 1 1 1 [female] 1 [low] 34 44 40 2 2 TRUE
#> 2 2 1 [female] 2 [middle] 39 41 33 2 2 TRUE
#> 3 3 0 [male] 1 [low] 63 65 48 2 2 TRUE
#> 4 4 1 [female] 1 [low] 44 50 41 2 2 TRUE
#> 5 5 0 [male] 1 [low] 47 40 43 2 2 TRUE
Reference
StataCorp. (2021). Stata Base Reference Manual: Duplicates. Retrieved from Stata Press: https://www.stata.com/manuals/rbase/duplicates.pdf
You can read more about it from the Stata Base Reference Manual, which contains detailed information about the duplicates command, through the Stata Press website or by searching for “Stata duplicates command” online.