Working with Duplicate Observations using `tag

This tutorial provides an overview and detailed usage examples for the tag_duplicates function in the mStats package. It is designed to identify and tag duplicate observations based on specified variables. It mimics the functionality of Stata’s duplicates command in R. It provides a report of duplicates and creates a tibble with three columns: .n_, .N_, and .dup_.

library(mStats)
#> 
#> Attaching package: 'mStats'
#> The following objects are masked from 'package:base':
#> 
#>     append, cut
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Example with a custom dataset
data <- data.frame(
  x = c(1, 1, 2, 2, 3, 4, 4, 5),
  y = letters[1:8]
)

# Identify and tag duplicates based on the x variable
data %>% mutate(tag_duplicates(x))
#> $ Report of duplicates
#>   in terms of x
#>  copies observations surplus
#>       1            2       0
#>       2            6       3
#>   x y .n_ .N_ .dup_
#> 1 1 a   1   2 FALSE
#> 2 1 b   2   2  TRUE
#> 3 2 c   1   2 FALSE
#> 4 2 d   2   2  TRUE
#> 5 3 e   1   1 FALSE
#> 6 4 f   1   2 FALSE
#> 7 4 g   2   2  TRUE
#> 8 5 h   1   1 FALSE

# Identify and tag duplicates based on multiple variables
data %>% mutate(tag_duplicates(x, y))
#> $ Report of duplicates
#>   in terms of x
#>  copies observations surplus
#>       1            8       0
#>   x y .n_ .N_ .dup_
#> 1 1 a   1   1 FALSE
#> 2 1 b   1   1 FALSE
#> 3 2 c   1   1 FALSE
#> 4 2 d   1   1 FALSE
#> 5 3 e   1   1 FALSE
#> 6 4 f   1   1 FALSE
#> 7 4 g   1   1 FALSE
#> 8 5 h   1   1 FALSE

# Identify and tag duplicates based on all variables
data %>% mutate(tag_duplicates(everything()))
#> $ Report of duplicates
#>   in terms of all variables
#>  copies observations surplus
#>       1            8       0
#>   x y .n_ .N_ .dup_
#> 1 1 a   1   1 FALSE
#> 2 1 b   1   1 FALSE
#> 3 2 c   1   1 FALSE
#> 4 2 d   1   1 FALSE
#> 5 3 e   1   1 FALSE
#> 6 4 f   1   1 FALSE
#> 7 4 g   1   1 FALSE
#> 8 5 h   1   1 FALSE

In the examples above, we use the tag_duplicates function to identify and tag duplicates in the data dataframe. We can specify one or multiple variables to check for duplicates. By using everything(), we can check for duplicates based on all variables in the dataframe.

The function provides a concise report, showing the number of duplicates and surplus observations for each group of variables. The .dup_ column indicates whether an observation is a duplicate or not.

Report of duplicates

The report of duplciates provides information about duplicate observations in a dataset and how they are interpreted. It consists of three components: copies, observations, and surplus.

Copies: The copies column indicates the number of times a set of observations is duplicated. In other words, it shows the number of identical rows present in the dataset.
Observations: The observations column represents the total count of unique observations in the dataset, including both original and duplicated rows.
Surplus: The surplus column denotes the number of additional observations beyond the first occurrence of each duplicated set. It indicates the number of duplicate rows excluding the original row.

To interpret the duplicates report in the context of the tag_duplicates R function, consider the following:

Copies: If the copies value is 1, it means there are no duplicates for that specific set of variables. If it is greater than 1, it indicates the number of duplicate sets present.
Observations: The observations count includes all unique rows in the dataset, including both original and duplicate rows. This count helps identify the total number of distinct observations.
Surplus: The surplus value represents the number of duplicate rows beyond the first occurrence of each set. These surplus rows need to be further examined or potentially removed.

By analyzing the duplicates report, you can identify which variables or combinations of variables are causing duplicates in the dataset. This information can guide further investigation or data cleaning steps to handle duplicate observations appropriately.

Appended columns: `.n_`, `.N_`, and `.dup_`

When using the tag_duplicates function in R, the function returns three columns: .n_, .N_, and .dup_. Here’s an explanation of how to interpret these columns:

`.n_`

The .n_ column represents the count of each observation in the dataset, including duplicates. It indicates the number of times each row appears in the dataset, regardless of whether it is a duplicate or unique. For unique observations, the value in this column will be 1. If an observation is duplicated multiple times, the value will be greater than 1. Interpretation of .n_:

If .n_ equals 1, it means the observation is unique and does not have any duplicates in the dataset. If .n_ is greater than 1, it indicates that the observation is duplicated and appears multiple times in the dataset.

`.N_`

The .N_ column represents the count of unique observations. It provides the number of unique occurrences for each observation in the dataset, considering both duplicates and unique rows. Each row in the dataset is assigned the count of its unique occurrence. Interpretation of .N_:

The .N_ column helps identify the total number of unique occurrences for each observation in the dataset, regardless of whether it is a duplicate or unique. If .N_ equals 1, it means the observation is unique and has only one occurrence in the dataset. If .N_ is greater than 1, it indicates that the observation has duplicates and appears multiple times in the dataset.

`.dup_`

The .dup_ column is a logical indicator that flags whether an observation is a duplicate or not. It assigns a value of TRUE if the observation is a duplicate and FALSE if it is unique. Interpretation of .dup_:

If .dup_ is TRUE, it means the observation is a duplicate and appears multiple times in the dataset. If .dup_ is FALSE, it indicates that the observation is unique and does not have any duplicates.

By examining these three columns together, you can determine which observations in your dataset are duplicates, how many times they occur, and whether an observation is unique or duplicated. This information can be valuable for further data analysis, quality control, or data cleaning processes.

UCLA STATA duplicates Tutorial in R

Read more about the tutorial here: https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/

hsb2 <- haven::read_dta("https://stats.idre.ucla.edu/stat/stata/notes/hsb2.dta")

hsb2_mod <- hsb2 |> 
    dplyr::select(id, female, ses, read, write, math) |> 
    dplyr::arrange(id)

hsb2_mod2 <- hsb2_mod |> 
    dplyr::filter(dplyr::row_number() <= 5) |> 
    dplyr::bind_rows(hsb2_mod) |> 
    dplyr::arrange(id) |> 
    dplyr::mutate(math = ifelse(dplyr::row_number() == 1, 84, math))

hsb2_mod2 |> 
    dplyr::mutate(tag_duplicates(everything()))
#> $ Report of duplicates
#>   in terms of all variables
#>  copies observations surplus
#>       1          197       0
#>       2            8       4
#> # A tibble: 205 × 9
#>       id female     ses         read write  math   .n_   .N_ .dup_
#>    <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
#>  1     1 1 [female] 1 [low]       34    44    84     1     1 FALSE
#>  2     1 1 [female] 1 [low]       34    44    40     1     1 FALSE
#>  3     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
#>  4     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
#>  5     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
#>  6     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
#>  7     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
#>  8     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
#>  9     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
#> 10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
#> # ℹ 195 more rows

hsb2_mod2 |> 
    dplyr::mutate(tag_duplicates(id))
#> $ Report of duplicates
#>   in terms of id
#>  copies observations surplus
#>       1          195       0
#>       2           10       5
#> # A tibble: 205 × 9
#>       id female     ses         read write  math   .n_   .N_ .dup_
#>    <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
#>  1     1 1 [female] 1 [low]       34    44    84     1     2 FALSE
#>  2     1 1 [female] 1 [low]       34    44    40     2     2 TRUE 
#>  3     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
#>  4     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
#>  5     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
#>  6     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
#>  7     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
#>  8     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
#>  9     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
#> 10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
#> # ℹ 195 more rows

hsb2_mod2 |> 
    dplyr::mutate(tag_duplicates(id, .add_tags = TRUE)) |> 
    # filter duplicate observations 
    dplyr::filter(.dup_)
#> $ Report of duplicates
#>   in terms of id
#>  copies observations surplus
#>       1          195       0
#>       2           10       5
#> # A tibble: 5 × 9
#>      id female     ses         read write  math   .n_   .N_ .dup_
#>   <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
#> 1     1 1 [female] 1 [low]       34    44    40     2     2 TRUE 
#> 2     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
#> 3     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
#> 4     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
#> 5     5 0 [male]   1 [low]       47    40    43     2     2 TRUE

`Reference`

StataCorp. (2021). Stata Base Reference Manual: Duplicates. Retrieved from Stata Press: https://www.stata.com/manuals/rbase/duplicates.pdf

You can read more about it from the Stata Base Reference Manual, which contains detailed information about the duplicates command, through the Stata Press website or by searching for “Stata duplicates command” online.

Working with Duplicate Observations using `tag_duplicates`