Numeric vectors with tags

library(shrthnd)

The development of the {shrthnd} package is heavily influenced by experience of working with statistical datasets published by governments and international bodies, especially departments and agencies in the UK producing outputs as part of the UK statistical system.

While data is increasingly released in machine readable formats or through APIs, there are still a large number of data products that continue to be released in spreadsheets and historical data from these institutions is often only available in spreadsheets. Beyond layout issues, such as the use of header and footer rows to communicate related information, it is not uncommon to encounter columns in these spreadsheets that contain a mix of numeric and non-numeric content. This non-numeric content may sometimes be the only content of a cell (to explain why there is no numeric value) or alongside a numeric value (to qualify, caveat or otherwise explain something about the value).

The most common approach in data processing when encountering these sorts of issues is simply to scrub the vector of the non-numeric components and coerced it into a numeric vector. However, these tags often convey useful information which you may wish to retain.

Introducing shrthnd_num()

The shrthnd_num() data type builds on vctrs::new_rcrd() to split numeric and non-numeric components while keeping them attached to each other. In practice a shrthnd_num() can be thought of as a numeric() and character() vector that have been coupled together. Specifically it has a num component representing the numeric value and a tag component representing the non-numeric shorthand, symbol or marker.

Let us create the vector x with seven values:

x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]")
x
#> [1] "12"        "34.567"    "[c]"       "NA"        "56.78[e]"  "78.9"     
#> [7] "90.123[e]"

The first, second and sixth values in this vector are purely numeric (12, 34.567 and 78.9). The third value is a shorthand symbol ("[c]") denoting that the value has been suppressed because it is confidential. The fourth value is a missing value ("NA"). The fifth and seventh values contain both numeric information (56.78 and 90.123 respectively) but also shorthand ("[e]") to denote that these values are estimated. Depending on what processing we wish to do with this vector in the future it might be useful to know that a value has been suppressed or estimated.

Using as.numeric() on this vector will result in all of the values containing any non-numeric element to be converted to a missing value, causing us to lose all the information of the third, fifth and seventh values in the vector.

as.numeric(x)
#> Warning: NAs introduced by coercion
#> [1] 12.000 34.567     NA     NA     NA 78.900     NA

We could scrub the non-numeric elements of the vector, but we still lose the information provided by the shorthand.

as.numeric(gsub("[^0-9.]", "", c(x)))
#> [1] 12.000 34.567     NA     NA 56.780 78.900 90.123

The shrthnd_num() function, however, allows us to retain both sets of information, and we can easily coerce a shrthnd_num() vector into a regular base R numeric() vector. We can also easily access the shorthand or symbol tags with the shrthnd_tags() function.

sh_x <- shrthnd_num(x)

sh_x
#> <shrthnd_num[7]>
#> [1] 12.00     34.57        NA [c]    NA     56.78 [e] 78.90     90.12 [e]

as.numeric(sh_x)
#> [1] 12.000 34.567     NA     NA 56.780 78.900 90.123

shrthnd_tags(sh_x)
#> [1] NA    NA    "[c]" NA    "[e]" NA    "[e]"

The shrthnd_list() function provides a summary of the tags contained in a shrthnd_num() vector, their frequency and positions in the vector.

shrthnd_list(sh_x)
#> <shrthnd_list[2]>
#> [c] (1 location): 3 
#> [e] (2 locations): 5, 7

We saw above how as.numeric() converts a shrthnd_num() to a numeric vector, as.character() will similarly convert a shrthnd_num() to a character vector as if it were a numeric vector. Instead to print a character vector that combines the numeric and non-numeric components we can use as_shrthnd().

as.character(sh_x)
#> [1] "12"     "34.567" NA       NA       "56.78"  "78.9"   "90.123"

as_shrthnd(sh_x)
#> [1] "12.00"     "34.57"     "NA [c]"    "NA"        "56.78 [e]" "78.90"    
#> [7] "90.12 [e]"

Making a shrthnd_num

You can make a shrthnd_num() vector in two ways: using shrthnd_num() to convert a character vector containing numeric and non-numeric components, or make_shrthnd_num() to merge a vector of numbers and a vector of character strings.

Conversion to a shrthnd_num

You convert a character vector containing shorthand using shrthnd_num(). In addition to the character vector you can also supply additional arguments to control the behaviour of the conversion and the resulting display of the vector.

shrthnd_num(
  x,
  shorthand = NULL,
  na_values = c("", "NA"),
  digits = 2L,
  paren_nums = c("negative", "strip"),
  dec = ".",
  bigmark = ","
)
#> <shrthnd_num[7]>
#> [1] 12.00     34.57        NA [c]    NA     56.78 [e] 78.90     90.12 [e]

The shorthand argument allows you to pass a character vector of shorthand, symbols and markers that you want to validate against, i.e. you can cause the conversion to throw an error if it detects shorthand that is not in this vector.

The na_values argument is used to determine values that should be ignored when identifying shorthand tags and converted to missing values when extracting the numeric component.

The digits, dec and bigmark arguments are passed on to formatC() in the formatting of the numeric component when formatting and printing the vector.

The paren_nums argument determines how to handle numbers in parenthesis, i.e. whether to consider a number in parenthesis as a negative number (as is commonly used in accounting formats, and the default setting) or whether to just strip the parenthesis from the number before its conversion.

The coercion to a numeric() vector is handled by utils::type.convert().

Making a shrthnd_num from scratch

You can use make_shrthnd_num() to create a shrthnd_num() from a numeric() and character() vector of the same length.

make_shrthnd_num(c(1:3, NA, 4:5, NA), c("", "", "", "[c]", "", "[e]", NA))
#> <shrthnd_num[7]>
#> [1]  1      2      3     NA [c]  4      5 [e] NA

Coercion, maths and statistics

Generally a shrthnd_num() should behave like a numeric() vector. For example, using is.na() will return TRUE where the numeric value is missing and FALSE where the numeric value is not missing. Or, if you use c() to combine a shrthnd_num() with another vector it will first coerce the vector to numeric so that R can proceed from there.

is.na(sh_x)
#> [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE

c(sh_x, 1)
#> [1] 12.000 34.567     NA     NA 56.780 78.900 90.123  1.000

c(sh_x, "c")
#> [1] "12"     "34.567" NA       NA       "56.78"  "78.9"   "90.123" "c"

However, in keeping with base R practice around complex numeric objects such as Date(), difftime() and POSIXct(), using is.numeric() on a shrthnd_num() vector will return FALSE. Use is_shrthnd_num() to test if a vector is a shrthnd_num() vector. {shrthnd} also includes an is_numeric() function that allows you to test for vectors that are either standard numeric vectors or a coercible shrthnd_num() vector, for example if you want to apply a function across a range of columns in a dplyr::mutate() call.

is.numeric(sh_x)
#> [1] FALSE

is_shrthnd_num(sh_x)
#> [1] TRUE

is_numeric(x)
#> [1] FALSE

Through vctrs::vec_arith() and vctrs::vec_math() there is generalised support for arithmetic and mathematical operations on a shrthnd_num() vector. Bespoke methods have also been added for some functions which are not directly supported, such as median() and quantile(), so that they can easily work with the numeric components of the shrthnd_num() vector.

x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]")
sh_x <- shrthnd_num(x, c("[c]", "[e]"))

sh_x * 2
#> [1]  24.000  69.134      NA      NA 113.560 157.800 180.246

2 + sh_x
#> [1] 14.000 36.567     NA     NA 58.780 80.900 92.123

sum(sh_x, na.rm = TRUE)
#> [1] 272.37

range(sh_x, na.rm = TRUE)
#> [1] 12.000 90.123

mean(sh_x, na.rm = TRUE)
#> [1] 54.474

Working with shrthnd tags

The shrthnd_tags() function allows us to access the tag components of a shrthnd_num(). It has a related function shrthnd_unique_tags() which will return a unique list of tags, and is simply a convenience function in place of unique(shrthnd_tags(x)).

shrthnd_tags(sh_x)
#> [1] NA    NA    "[c]" NA    "[e]" NA    "[e]"

shrthnd_unique_tags(sh_x)
#> [1] "[c]" "[e]"

The base R functions for value matching work with the numeric component of a shrthnd_num() vector. Separate tag locator functions have been used to support matching the tag components of a shrthnd_num() vector.

tag_match() returns an integer vector showing the first location of the tag provided while tag_in() will return TRUE or FALSE depending on whether the tag is in the vector’s shorthand.

tag_match(sh_x, "[e]")
#> [1] 5

tag_in(sh_x, "[e]")
#> [1] TRUE

To locate where a specific tag is used in a vector use where_tag(), which is equivalent to computing tags == tag. To identify if a value has a tag, irrespective of its value use any_tag(), which is equivalent to !is.na(tags).

where_tag(sh_x, "[e]")
#> [1]    NA    NA FALSE    NA  TRUE    NA  TRUE

any_tag(sh_x)
#> [1] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE

Using is.na() on a shrthnd_num() will assess if the numeric component is missing. To identify if tags are missing use is_na_tag(), which is equivalent to is.na(tags). To identify if both the numeric and tag component is missing use is_na_both().

is_na_tag(sh_x)
#> [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE

is_na_both(sh_x)
#> [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

Finally, you can locate the positions of a specific tag, tagged values or untagged values using a set of locate_*() functions, which are convenience functions wrapping the functions that return logical vectors in which().

locate_tag(sh_x, "[e]")
#> [1] 5 7

locate_any_tag(sh_x)
#> [1] 3 5 7

locate_no_tag(sh_x)
#> [1] 1 2 4 6