Skip to contents

[Stable]

Usage

tag_duplicates(..., .add_tags = TRUE)

Arguments

...

Columns to use for identifying duplicates.

.add_tags

logical to return three indicator columns: .n_, .N_, and .dup_.

Value

A tibble with three columns: .n_, .N_, and .dup_.

  • .n_ represents the running counter within each group of variables, indicating the number of the current observation.

  • .N_ represents the total number of observations within each group of variables.

  • .dup_ is a logical column indicating whether the observation is a duplicate (TRUE) or not (FALSE).

Details

This function identifies and tags duplicate observations based on specified variables.

This function mimics the functionality of Stata's duplicates command in R. It calculates the number of duplicates and provides a report of duplicates based on the specified variables. The function utilizes the n_ and N_ functions for counting and grouping the observations.

See also

Other Data Management: append(), codebook(), count_functions, cut()

Examples


library(dplyr)

# Example with a custom dataset
data <- data.frame(
  x = c(1, 1, 2, 2, 3, 4, 4, 5),
  y = letters[1:8]
)

# Identify and tag duplicates based on the "x" variable
data %>% mutate(tag_duplicates(x))
#> $ Report of duplicates
#>   in terms of x
#>  copies observations surplus
#>       1            2       0
#>       2            6       3
#>   x y .n_ .N_ .dup_
#> 1 1 a   1   2 FALSE
#> 2 1 b   2   2  TRUE
#> 3 2 c   1   2 FALSE
#> 4 2 d   2   2  TRUE
#> 5 3 e   1   1 FALSE
#> 6 4 f   1   2 FALSE
#> 7 4 g   2   2  TRUE
#> 8 5 h   1   1 FALSE

# Identify and tag duplicates based on multiple variables
data %>% mutate(tag_duplicates(x, y))
#> $ Report of duplicates
#>   in terms of x
#>  copies observations surplus
#>       1            8       0
#>   x y .n_ .N_ .dup_
#> 1 1 a   1   1 FALSE
#> 2 1 b   1   1 FALSE
#> 3 2 c   1   1 FALSE
#> 4 2 d   1   1 FALSE
#> 5 3 e   1   1 FALSE
#> 6 4 f   1   1 FALSE
#> 7 4 g   1   1 FALSE
#> 8 5 h   1   1 FALSE

# Identify and tag duplicates based on all variables
data %>% mutate(tag_duplicates(everything()))
#> $ Report of duplicates
#>   in terms of all variables
#>  copies observations surplus
#>       1            8       0
#>   x y .n_ .N_ .dup_
#> 1 1 a   1   1 FALSE
#> 2 1 b   1   1 FALSE
#> 3 2 c   1   1 FALSE
#> 4 2 d   1   1 FALSE
#> 5 3 e   1   1 FALSE
#> 6 4 f   1   1 FALSE
#> 7 4 g   1   1 FALSE
#> 8 5 h   1   1 FALSE

if (FALSE) {
## STATA example
dupxmpl <- haven::read_dta("https://www.stata-press.com/data/r18/dupxmpl.dta")
dupxmpl |> mutate(tag_duplicates(everything()))
}