The development of the {shrthnd}
package is heavily
influenced by experience of working with statistical datasets published
by governments and international bodies, especially departments and
agencies in the UK producing outputs as part of the UK statistical
system.
While data is increasingly released in machine readable formats or through APIs, there are still a large number of data products that continue to be released in spreadsheets and historical data from these institutions is often only available in spreadsheets. Beyond layout issues, such as the use of header and footer rows to communicate related information, it is not uncommon to encounter columns in these spreadsheets that contain a mix of numeric and non-numeric content. This non-numeric content may sometimes be the only content of a cell (to explain why there is no numeric value) or alongside a numeric value (to qualify, caveat or otherwise explain something about the value).
The most common approach in data processing when encountering these sorts of issues is simply to scrub the vector of the non-numeric components and coerced it into a numeric vector. However, these tags often convey useful information which you may wish to retain.
shrthnd_num()
The shrthnd_num()
data type builds on
vctrs::new_rcrd()
to split numeric and non-numeric
components while keeping them attached to each other. In practice a
shrthnd_num()
can be thought of as a numeric()
and character()
vector that have been coupled together.
Specifically it has a num
component representing the
numeric value and a tag
component representing the
non-numeric shorthand, symbol or marker.
Let us create the vector x
with seven values:
x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]")
x
#> [1] "12" "34.567" "[c]" "NA" "56.78[e]" "78.9"
#> [7] "90.123[e]"
The first, second and sixth values in this vector are purely numeric
(12
, 34.567
and 78.9
). The third
value is a shorthand symbol ("[c]"
) denoting that the value
has been suppressed because it is confidential. The fourth value is a
missing value ("NA"
). The fifth and seventh values contain
both numeric information (56.78
and 90.123
respectively) but also shorthand ("[e]"
) to denote that
these values are estimated. Depending on what processing we wish to do
with this vector in the future it might be useful to know that a value
has been suppressed or estimated.
Using as.numeric()
on this vector will result in all of
the values containing any non-numeric element to be converted to a
missing value, causing us to lose all the information of the third,
fifth and seventh values in the vector.
We could scrub the non-numeric elements of the vector, but we still lose the information provided by the shorthand.
The shrthnd_num()
function, however, allows us to retain
both sets of information, and we can easily coerce a
shrthnd_num()
vector into a regular base R
numeric()
vector. We can also easily access the shorthand
or symbol tags with the shrthnd_tags()
function.
sh_x <- shrthnd_num(x)
sh_x
#> <shrthnd_num[7]>
#> [1] 12.00 34.57 NA [c] NA 56.78 [e] 78.90 90.12 [e]
as.numeric(sh_x)
#> [1] 12.000 34.567 NA NA 56.780 78.900 90.123
shrthnd_tags(sh_x)
#> [1] NA NA "[c]" NA "[e]" NA "[e]"
The shrthnd_list()
function provides a summary of the
tags contained in a shrthnd_num()
vector, their frequency
and positions in the vector.
We saw above how as.numeric()
converts a
shrthnd_num()
to a numeric vector,
as.character()
will similarly convert a
shrthnd_num()
to a character vector as if it were a numeric
vector. Instead to print a character vector that combines the numeric
and non-numeric components we can use as_shrthnd()
.
You can make a shrthnd_num()
vector in two ways: using
shrthnd_num()
to convert a character vector containing
numeric and non-numeric components, or make_shrthnd_num()
to merge a vector of numbers and a vector of character strings.
You convert a character vector containing shorthand using
shrthnd_num()
. In addition to the character vector you can
also supply additional arguments to control the behaviour of the
conversion and the resulting display of the vector.
shrthnd_num(
x,
shorthand = NULL,
na_values = c("", "NA"),
digits = 2L,
paren_nums = c("negative", "strip"),
dec = ".",
bigmark = ","
)
#> <shrthnd_num[7]>
#> [1] 12.00 34.57 NA [c] NA 56.78 [e] 78.90 90.12 [e]
The shorthand
argument allows you to pass a character
vector of shorthand, symbols and markers that you want to validate
against, i.e. you can cause the conversion to throw an error if it
detects shorthand that is not in this vector.
The na_values
argument is used to determine values that
should be ignored when identifying shorthand tags and converted to
missing values when extracting the numeric component.
The digits
, dec
and bigmark
arguments are passed on to formatC()
in the formatting of
the numeric component when formatting and printing the vector.
The paren_nums
argument determines how to handle numbers
in parenthesis, i.e. whether to consider a number in parenthesis as a
negative number (as is commonly used in accounting formats, and the
default setting) or whether to just strip the parenthesis from the
number before its conversion.
The coercion to a numeric()
vector is handled by
utils::type.convert()
.
Generally a shrthnd_num()
should behave like a
numeric()
vector. For example, using is.na()
will return TRUE
where the numeric value is missing and
FALSE
where the numeric value is not missing. Or, if you
use c()
to combine a shrthnd_num()
with
another vector it will first coerce the vector to numeric so that R can
proceed from there.
is.na(sh_x)
#> [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE
c(sh_x, 1)
#> [1] 12.000 34.567 NA NA 56.780 78.900 90.123 1.000
c(sh_x, "c")
#> [1] "12" "34.567" NA NA "56.78" "78.9" "90.123" "c"
However, in keeping with base R practice around complex numeric
objects such as Date()
, difftime()
and
POSIXct()
, using is.numeric()
on a
shrthnd_num()
vector will return FALSE
. Use
is_shrthnd_num()
to test if a vector is a
shrthnd_num()
vector. {shrthnd}
also includes
an is_numeric()
function that allows you to test for
vectors that are either standard numeric vectors or a coercible
shrthnd_num()
vector, for example if you want to apply a
function across a range of columns in a dplyr::mutate()
call.
Through vctrs::vec_arith()
and
vctrs::vec_math()
there is generalised support for
arithmetic and mathematical operations on a shrthnd_num()
vector. Bespoke methods have also been added for some functions which
are not directly supported, such as median()
and
quantile()
, so that they can easily work with the numeric
components of the shrthnd_num()
vector.
x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]")
sh_x <- shrthnd_num(x, c("[c]", "[e]"))
sh_x * 2
#> [1] 24.000 69.134 NA NA 113.560 157.800 180.246
2 + sh_x
#> [1] 14.000 36.567 NA NA 58.780 80.900 92.123
sum(sh_x, na.rm = TRUE)
#> [1] 272.37
range(sh_x, na.rm = TRUE)
#> [1] 12.000 90.123
mean(sh_x, na.rm = TRUE)
#> [1] 54.474