List, tag, report duplicates in R like STATA

R
Data Wrangling
mStats
Author

Myo Minn Oo

Published

June 15, 2023

Modified

February 21, 2026

1 Replicating examples on UCLA’s STATA tutorial in R

Citation: HOW CAN I DETECT DUPLICATE OBSERVATIONS? | STATA FAQ. UCLA: Statistical Consulting Group. from https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/ (accessed June 15, 2023).

The tutorial on the website used the High School and Beyond dataset. Here are the steps taken to introduce duplicates to the dataset.

1.1 Install mStats

Code
devtools::install_github("myominnoo/mStats")

1.2 Downloading a dummy data

Start with the High School and Beyond dataset, which initially has no duplicate observations.

Code
library(tidyverse)
hsb2 <-
  # load the dataset
  haven::read_dta("https://stats.idre.ucla.edu/stat/stata/notes/hsb2.dta") |>
  # select variables of interest
  dplyr::select(id, female, ses, read, write, math) |>
  # sort by id
  dplyr::arrange(id)

1.3 Creating duplicates

Add five duplicate observations to the dataset to create duplicates. Change a value in one of the duplicate observations.

Code
hsb2_mod <-
  hsb2 |>
  # take the first five observations
  slice(1:5) |>
  # add duplicate observations
  dplyr::bind_rows(hsb2) |>
  dplyr::mutate(math = ifelse(dplyr::row_number() == 1, 84, math))
# display the first few rows
hsb2_mod
# A tibble: 205 × 6
      id female     ses         read write  math
   <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl>
 1     1 1 [female] 1 [low]       34    44    84
 2     2 1 [female] 2 [middle]    39    41    33
 3     3 0 [male]   1 [low]       63    65    48
 4     4 1 [female] 1 [low]       44    50    41
 5     5 0 [male]   1 [low]       47    40    43
 6     1 1 [female] 1 [low]       34    44    40
 7     2 1 [female] 2 [middle]    39    41    33
 8     3 0 [male]   1 [low]       63    65    48
 9     4 1 [female] 1 [low]       44    50    41
10     5 0 [male]   1 [low]       47    40    43
# ℹ 195 more rows

1.4 Checking duplicates

After adding the duplicate observations, you will have a total of 195 unique observations and 5 duplicated observations in the dataset. We can use the tag_duplicates() function from the mStats package.

Code
hsb2_mod |>
  # check duplicate report and status using a mStats function
  dplyr::mutate(mStats::tag_duplicates(dplyr::everything()))
$ Report of duplicates
  in terms of dplyr::everything()
 copies observations surplus
      1          197       0
      2            8       4
# A tibble: 205 × 9
      id female     ses         read write  math   .n_   .N_ .dup_
   <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
 1     1 1 [female] 1 [low]       34    44    84     1     1 FALSE
 2     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
 3     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
 4     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
 5     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
 6     1 1 [female] 1 [low]       34    44    40     1     1 FALSE
 7     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
 8     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
 9     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
# ℹ 195 more rows

Let’s check duplicates by id.

Code
hsb2_mod |>
  # check duplicates by id
  dplyr::mutate(mStats::tag_duplicates(id))
$ Report of duplicates
  in terms of id
 copies observations surplus
      1          195       0
      2           10       5
# A tibble: 205 × 9
      id female     ses         read write  math   .n_   .N_ .dup_
   <dbl> <dbl+lbl>  <dbl+lbl>  <dbl> <dbl> <dbl> <int> <int> <lgl>
 1     1 1 [female] 1 [low]       34    44    84     1     2 FALSE
 2     2 1 [female] 2 [middle]    39    41    33     1     2 FALSE
 3     3 0 [male]   1 [low]       63    65    48     1     2 FALSE
 4     4 1 [female] 1 [low]       44    50    41     1     2 FALSE
 5     5 0 [male]   1 [low]       47    40    43     1     2 FALSE
 6     1 1 [female] 1 [low]       34    44    40     2     2 TRUE 
 7     2 1 [female] 2 [middle]    39    41    33     2     2 TRUE 
 8     3 0 [male]   1 [low]       63    65    48     2     2 TRUE 
 9     4 1 [female] 1 [low]       44    50    41     2     2 TRUE 
10     5 0 [male]   1 [low]       47    40    43     2     2 TRUE 
# ℹ 195 more rows

Photo credit: Photo by Dids from Pexels

Citation

BibTeX citation:
@online{minn_oo2023,
  author = {Minn Oo, Myo},
  title = {Duplicates in {R}},
  date = {2023-06-15},
  url = {https://myominnoo.github.io/blog/2023-06-15-duplicates-R/},
  langid = {en}
}
For attribution, please cite this work as:
Minn Oo, Myo. 2023. “Duplicates in R.” June 15, 2023. https://myominnoo.github.io/blog/2023-06-15-duplicates-R/.