Tag Duplicates
tag_duplicates.Rd
Arguments
- ...
Columns to use for identifying duplicates.
- .add_tags
logical to return three indicator columns:
.n_
,.N_
, and.dup_
.
Value
A tibble with three columns: .n_
, .N_
, and .dup_
.
.n_
represents the running counter within each group of variables, indicating the number of the current observation..N_
represents the total number of observations within each group of variables..dup_
is a logical column indicating whether the observation is a duplicate (TRUE) or not (FALSE).
Details
This function identifies and tags duplicate observations based on specified variables.
This function mimics the functionality of Stata's duplicates
command in R.
It calculates the number of duplicates and provides a report of duplicates
based on the specified variables. The function utilizes the n_ and N_ functions
for counting and grouping the observations.
See also
Other Data Management:
append()
,
codebook()
,
count_functions
,
cut()
Examples
library(dplyr)
# Example with a custom dataset
data <- data.frame(
x = c(1, 1, 2, 2, 3, 4, 4, 5),
y = letters[1:8]
)
# Identify and tag duplicates based on the "x" variable
data %>% mutate(tag_duplicates(x))
#> $ Report of duplicates
#> in terms of x
#> copies observations surplus
#> 1 2 0
#> 2 6 3
#> x y .n_ .N_ .dup_
#> 1 1 a 1 2 FALSE
#> 2 1 b 2 2 TRUE
#> 3 2 c 1 2 FALSE
#> 4 2 d 2 2 TRUE
#> 5 3 e 1 1 FALSE
#> 6 4 f 1 2 FALSE
#> 7 4 g 2 2 TRUE
#> 8 5 h 1 1 FALSE
# Identify and tag duplicates based on multiple variables
data %>% mutate(tag_duplicates(x, y))
#> $ Report of duplicates
#> in terms of x
#> copies observations surplus
#> 1 8 0
#> x y .n_ .N_ .dup_
#> 1 1 a 1 1 FALSE
#> 2 1 b 1 1 FALSE
#> 3 2 c 1 1 FALSE
#> 4 2 d 1 1 FALSE
#> 5 3 e 1 1 FALSE
#> 6 4 f 1 1 FALSE
#> 7 4 g 1 1 FALSE
#> 8 5 h 1 1 FALSE
# Identify and tag duplicates based on all variables
data %>% mutate(tag_duplicates(everything()))
#> $ Report of duplicates
#> in terms of all variables
#> copies observations surplus
#> 1 8 0
#> x y .n_ .N_ .dup_
#> 1 1 a 1 1 FALSE
#> 2 1 b 1 1 FALSE
#> 3 2 c 1 1 FALSE
#> 4 2 d 1 1 FALSE
#> 5 3 e 1 1 FALSE
#> 6 4 f 1 1 FALSE
#> 7 4 g 1 1 FALSE
#> 8 5 h 1 1 FALSE
if (FALSE) {
## STATA example
dupxmpl <- haven::read_dta("https://www.stata-press.com/data/r18/dupxmpl.dta")
dupxmpl |> mutate(tag_duplicates(everything()))
}