Duplicates in R
List, tag, report duplicates in R like STATA
1 Replicating examples on UCLA’s STATA tutorial in R
Citation: HOW CAN I DETECT DUPLICATE OBSERVATIONS? | STATA FAQ. UCLA: Statistical Consulting Group. from https://stats.oarc.ucla.edu/stata/faq/how-can-i-detect-duplicate-observations-3/ (accessed June 15, 2023).
The tutorial on the website used the High School and Beyond dataset. Here are the steps taken to introduce duplicates to the dataset.
1.1 Install mStats
1.2 Downloading a dummy data
Start with the High School and Beyond dataset, which initially has no duplicate observations.
1.3 Creating duplicates
Add five duplicate observations to the dataset to create duplicates. Change a value in one of the duplicate observations.
Code
# A tibble: 205 × 6
id female ses read write math
<dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl>
1 1 1 [female] 1 [low] 34 44 84
2 2 1 [female] 2 [middle] 39 41 33
3 3 0 [male] 1 [low] 63 65 48
4 4 1 [female] 1 [low] 44 50 41
5 5 0 [male] 1 [low] 47 40 43
6 1 1 [female] 1 [low] 34 44 40
7 2 1 [female] 2 [middle] 39 41 33
8 3 0 [male] 1 [low] 63 65 48
9 4 1 [female] 1 [low] 44 50 41
10 5 0 [male] 1 [low] 47 40 43
# ℹ 195 more rows
1.4 Checking duplicates
After adding the duplicate observations, you will have a total of 195 unique observations and 5 duplicated observations in the dataset. We can use the tag_duplicates() function from the mStats package.
Code
$ Report of duplicates
in terms of dplyr::everything()
copies observations surplus
1 197 0
2 8 4
# A tibble: 205 × 9
id female ses read write math .n_ .N_ .dup_
<dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <int> <int> <lgl>
1 1 1 [female] 1 [low] 34 44 84 1 1 FALSE
2 2 1 [female] 2 [middle] 39 41 33 1 2 FALSE
3 3 0 [male] 1 [low] 63 65 48 1 2 FALSE
4 4 1 [female] 1 [low] 44 50 41 1 2 FALSE
5 5 0 [male] 1 [low] 47 40 43 1 2 FALSE
6 1 1 [female] 1 [low] 34 44 40 1 1 FALSE
7 2 1 [female] 2 [middle] 39 41 33 2 2 TRUE
8 3 0 [male] 1 [low] 63 65 48 2 2 TRUE
9 4 1 [female] 1 [low] 44 50 41 2 2 TRUE
10 5 0 [male] 1 [low] 47 40 43 2 2 TRUE
# ℹ 195 more rows
Let’s check duplicates by id.
$ Report of duplicates
in terms of id
copies observations surplus
1 195 0
2 10 5
# A tibble: 205 × 9
id female ses read write math .n_ .N_ .dup_
<dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <int> <int> <lgl>
1 1 1 [female] 1 [low] 34 44 84 1 2 FALSE
2 2 1 [female] 2 [middle] 39 41 33 1 2 FALSE
3 3 0 [male] 1 [low] 63 65 48 1 2 FALSE
4 4 1 [female] 1 [low] 44 50 41 1 2 FALSE
5 5 0 [male] 1 [low] 47 40 43 1 2 FALSE
6 1 1 [female] 1 [low] 34 44 40 2 2 TRUE
7 2 1 [female] 2 [middle] 39 41 33 2 2 TRUE
8 3 0 [male] 1 [low] 63 65 48 2 2 TRUE
9 4 1 [female] 1 [low] 44 50 41 2 2 TRUE
10 5 0 [male] 1 [low] 47 40 43 2 2 TRUE
# ℹ 195 more rows
Photo credit: Photo by Dids from Pexels
Citation
@online{minn_oo2023,
author = {Minn Oo, Myo},
title = {Duplicates in {R}},
date = {2023-06-15},
url = {https://myominnoo.github.io/blog/2023-06-15-duplicates-R/},
langid = {en}
}