The aim of {tidyods}
is to extract information about
cells in OpenDocument Spreadsheet (ODS) files into a “tidy” format. In
addition to the value contained by a cell it extracts a range of
addition metadata about the cells.
This vignette explains how to use
tidyods::read_ods_cells()
and the types of information
about cells it extracts. The Spreadsheet
Munging Strategies book by Duncan Garmonsway explores in more
detail how you can use datasets of “tidy cells” to work with non-tabular
data stored in (Excel) spreadsheets.
You read in data from an ODS file with the
read_ods_cells()
function. The resulting tibble provides 28
columns of information about cells, with each row representing an
individual cell in the input spreadsheet.
example_file <- system.file("extdata", "basic_example.ods", package = "tidyods")
types_cells <- read_ods_cells(example_file, 2)
dplyr::glimpse(types_cells)
#> Rows: 160
#> Columns: 28
#> $ sheet <chr> "types", "types", "types", "types", "types", "t…
#> $ address <chr> "A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1",…
#> $ row <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,…
#> $ col <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6…
#> $ cell_type <chr> "cell", "cell", "cell", "cell", "cell", "cell",…
#> $ is_empty <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ value_type <chr> "string", "string", "string", "string", "string…
#> $ cell_content <chr> "String", "Boolean", "Currency", "Date", "Time"…
#> $ base_value <chr> "String", "Boolean", "Currency", "Date", "Time"…
#> $ numeric_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ currency_symbol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ boolean_value <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, TRU…
#> $ date_value <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ time_value <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_formula <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ formula <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_error <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ error_type <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_annotation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ annotation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ is_merged <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ merge_colspan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_rowspan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_shape <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ cell_style <chr> NA, "Default", "Default", "Default", "Default",…
#> $ row_style <chr> "ro1", "ro1", "ro1", "ro1", "ro1", "ro1", "ro1"…
#> $ col_style <chr> "co6", "co7", "co8", "co9", "co10", "co11", "co…
#> $ col_default_cell_style <chr> "Default", "ce56", "ce4", "ce11", "ce21", "ce31…
There are four arguments to read_ods_cells()
:
path
, the path to the ODS filesheet
, the sheet you wish to extractquick
, whether to extract a subset of cell
informationquiet
, whether to show console messages or notThe sheet
argument can either be the index number of a
sheet (ranging from 1 to the total number of sheets in the spreadsheet
document) or the name of a sheet (e.g. "types"
instead of
2
in the example above).
The quick
argument extracts only a small number of
columns and is useful if you are largely concerned with cell values; see
the quick extraction section for more
details.
The quiet
argument turns off console messages, ODS files
are written in XML which can take significant time to process and so
progress messages are provided to the console for interactive users. It
is advised to set this to false if you are doing batch operations.
There are 28 types of information returned by
read_ods_cells()
, these can be grouped into three
categories: location information, value information, and other
metadata.
The first four columns of information relate to a cell’s location in the spreadsheet.
sheet
, the name of the sheet the cell belongs toaddress
, the address of the cell within a sheet using
“A1” notationrow
, the row number of the cellcol
, the column number of the cellThe address
column is derived from the row
and col
using the {cellranger}
package. It
provides the address in the “A1” notation commonly used in spreadsheet
applications where the letter corresponds to a column and the number
corresponds to a row. The “A1” notation is used by ODS files for
calculation formulas.
There are 10 columns that provide information about a cell’s value.
cell_type
, the type of cellis_empty
, whether the cell has content or not
(i.e. cell_type == "empty"
)value_type
, the value type as set in the ODS filecell_content
, the cell’s value as formatted for users
of spreadsheet applicationsbase_value
, the “raw” value of the cellnumeric_value
, for numeric value types the raw value as
a numeric()
vectorcurrency_symbol
, for currency value types the currency
symbol associated with the numeric valueboolean_value
, for boolean value types the raw value as
a logical()
vectordate_value
, for date value types (dates and date-times)
the “raw” value as a character()
vectortime_value
, for time value types, the “raw” value as a
character()
vectorCells have one of four types, it is empty
if it has no
content and/or no value type, it is a merge-lead
if it is
the top-left cell of a merge group while other cells in the merge group
are classified as merge-hidden
, all other cells are simply
a cell
.
The value_type
is derived from the ODS
specification which defines seven value types:
boolean
, equivalent to R’s logical()
data
typecurrency
, a numeric value, optionally with a currency
symboldate
, a date or date-time stored in ISO 8601 format,
the ODS specification does not differentiate between a date
(e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47)float
, a numeric value, the ODS format does not
distinguish between different numeric typespercentage
, a numeric value, but formatted for display
as a percentagestring
, equivalent to R’s character()
time
, a duration, stored in ISO 8601 formatThe cell_content
column provides the value of a cell as
displayed and formatted to users of a spreadsheet application.
Non-string values are stored separately in the underlying ODS XML file,
and these are extracted as separate columns (numeric_value
,
boolean_value
, date_value
and
time_value
) by read_ods_cells()
, for numeric
and boolean values these are also coerced into a numeric()
and logical()
vector. The base_value
column
provides a combination of the cell_content
for cells that
are strings and for cells that are not string value types the value from
the relevant type specific column.
For currency value types optionally a currency symbol (e.g. GBP, EUR,
USD, JPY) can be set, this is provided in the
currency_symbol
column.
types_cells |>
dplyr::filter(row == 3) |>
dplyr::select(value_type:currency_symbol)
#> # A tibble: 10 × 5
#> value_type cell_content base_value numeric_value currency_symbol
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 string Cell with comment Cell with comment NA <NA>
#> 2 boolean FALSE false NA <NA>
#> 3 currency £1.20 1.2 1.2 GBP
#> 4 date 06/15/22 2022-06-15 NA <NA>
#> 5 time 13:24 PT13H24M56S NA <NA>
#> 6 date 15/06/22 13:24 2022-06-15T13:24:… NA <NA>
#> 7 float 12034.57 12034.56789 12035. <NA>
#> 8 percentage 55% 0.5467 0.547 <NA>
#> 9 float 6579.3 6579.298265463 6579. <NA>
#> 10 string #DIV/0! #DIV/0! NA <NA>
types_cells |>
dplyr::filter(row == 3) |>
dplyr::select(value_type, base_value, boolean_value:time_value)
#> # A tibble: 10 × 5
#> value_type base_value boolean_value date_value time_value
#> <chr> <chr> <lgl> <chr> <chr>
#> 1 string Cell with comment NA <NA> <NA>
#> 2 boolean false FALSE <NA> <NA>
#> 3 currency 1.2 NA <NA> <NA>
#> 4 date 2022-06-15 NA 2022-06-15 <NA>
#> 5 time PT13H24M56S NA <NA> PT13H24M56S
#> 6 date 2022-06-15T13:24:56 NA 2022-06-15T13:24:56 <NA>
#> 7 float 12034.56789 NA <NA> <NA>
#> 8 percentage 0.5467 NA <NA> <NA>
#> 9 float 6579.298265463 NA <NA> <NA>
#> 10 string #DIV/0! NA <NA> <NA>
The value type examples section below provides more details on the different value types and how they are stored and processed.
Finally, read_ods_cells()
provides 12 further columns
with additional metadata about each cell.
has_formula
, whether the cell value is calculated by a
formulaformula
, the formula used to calculate the cell
valuehas_error
, whether the cell value or formula results in
an errorerror_type
, the type of error the cell is
experiencinghas_annotation
, whether the cell has a
comment/annotationannotation
, the content of the cell’s
comment/annotationis_merged
, whether the cell is part of a merge
groupmerge_colspan
, for the lead cell in a merge group the
number of columns spanned by the merge groupmerge_rowspan
, for the lead cell in a merge group the
number of rows spanned by the merge groupmerge_shape
, for the lead cell in a merge group the
shape of the merge groupcell_style
, the style reference for the cellrow_style
, the style reference for the rowcol_style
, the style reference for the columncol_deftaul_cell_style
, the default style reference for
cells in the columnThe has_formula
column indicates if a cell value is
calculated by a formula, which is itself provided in the
formula
column.
types_cells |>
dplyr::filter(col == 9 & has_formula) |>
dplyr::select(value_type, cell_content, numeric_value, has_formula, formula) |>
head()
#> # A tibble: 6 × 5
#> value_type cell_content numeric_value has_formula formula
#> <chr> <chr> <dbl> <lgl> <chr>
#> 1 float 6579.3 6579. TRUE of:=[.G2]*[.H2]
#> 2 float 6579.3 6579. TRUE of:=[.G3]*[.H3]
#> 3 float 6579.3 6579. TRUE of:=[.G4]*[.H4]
#> 4 float 6579.3 6579. TRUE of:=[.G5]*[.H5]
#> 5 float 6579.3 6579. TRUE of:=[.G6]*[.H6]
#> 6 float 6579.3 6579. TRUE of:=[.G7]*[.H7]
For more on the use of formulas in ODS files you can consult the OpenFormula specification.
Formulas in spreadsheets can result in calculation errors. The
has_error
column indicates if there the cell results in an
error. The error_type
column is similar to the
=ERROR.TYPE()
function in Microsoft Excel, Google Sheets
and LibreOffice, these provide a number to denote the type of error.
error_type
will have a value from 1 to 7 or 501 to
540:
1
for #NULL! errors in Excel/Google Sheets or Err:511
in LibreOffice, typically this is some sort of error in the formula
specification such as a missing operator (Excel/Google Sheets) or a
missing variable (LibreOffice)2
for #DIV/0! (division by zero) errors3
for #VALUE! (value type) errors, e.g. when a formula
uses a string rather than a number4
for #REF! (invalid cell reference) errors5
for #NAME? errors, e.g. formula specifies a function
that doesn’t exist6
for #NUM! (calculation overflow) errors, e.g. the
result of 9999^9999
7
for #N/A (logic) errors, e.g. unable to match
values501
to 540
a LibreOffice errorLibreOffice assigns all errors an error code from 501 to 540, where
not displayed as an error like #DIV/0!
these are shown in
cell_content
as "Err:5##"
. See LibreOffice’s
documentation on error
codes for more details on these error codes.
Spreadsheet applications allow you to add notes and comments to cells
to provide addition contextual information. The
has_annotation
column indicates whether the cell has an
annotation associated with it. The annotation
column
provides the text content of the annotation.
types_cells |>
dplyr::filter(has_annotation) |>
dplyr::select(has_annotation, annotation)
#> # A tibble: 2 × 2
#> has_annotation annotation
#> <lgl> <chr>
#> 1 TRUE "Test comment"
#> 2 TRUE "Comment 2\nMore comments\nComments with space\nBold comment"
Note: different spreadsheet applications handle
annotations in different ways, read_ods_cells()
does not
take account of these differences and provides only a basic
representation of the text contained in the annotation. LibreOffice only
supports a basic “comment” annotation which is called a “note” in
Microsoft Excel and Google Sheets.
In addition to a “note” Microsoft Excel and Google Sheets proprietary formats allow for threaded comments that allow users to leave comments and reply to these comments, when saving into ODS these features are saved as annotations like notes. However the coercion of these threads into a note and inclusion of comment/reply author details varies.
Spreadsheets allow you to merge cells, largely this is for
presentational purposes. Merged cells have a cell_type
of
either merge-lead
for the top-left cell of a merge group or
merge-hidden
for a cell hidden by a merge group. The
is_merged
indicates whether the cell is either of these
types.
When merging cells in Microsoft Office and Google Sheets then only content in the top-left cell is retained, however in LibreOffice merging cells retains the content of the cells that are hidden by a merge operation.
types_cells |>
dplyr::filter(is_merged) |>
dplyr::select(address, cell_type, cell_content, is_merged)
#> # A tibble: 4 × 4
#> address cell_type cell_content is_merged
#> <chr> <chr> <chr> <lgl>
#> 1 A8 merge-lead Merged cell TRUE
#> 2 A9 merge-hidden <NA> TRUE
#> 3 A10 merge-lead Merged cell with hidden text TRUE
#> 4 A11 merge-hidden Hidden text TRUE
For merge-lead
cells you will also get the number of
columns and rows spanned by the merge group in
merge_colspan
and merge_rowspan
respectively.
This span information is used to define the merge_shape
column:
vertical
when
merge_colspan
is equal to 1 and merge_rowspan
is greater than 1horizontal
when merge_colspan
is greater than 1 and
merge_rowspan
is equal to 1rectangular
.A | B | C | D | |
---|---|---|---|---|
1 | A cell across two columns (a horizontal merge) | C1 | D1 | |
2 | A cell across two rows (a vertical merge) | B2 | C2 | D2 |
3 | B3 | C3 | D3 | |
4 | A4 | B4 | A cell across two columns and two rows (a rectangular merge) | |
5 | A5 | B5 |
merge_cells <- read_ods_cells(example_file, 3)
merge_cells |>
dplyr::select(row, col, cell_type) |>
dplyr::mutate(col = paste0("x", col)) |>
tidyr::pivot_wider(names_from = col, values_from = cell_type)
#> # A tibble: 5 × 5
#> row x1 x2 x3 x4
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 merge-lead merge-hidden <NA> <NA>
#> 2 2 merge-lead <NA> <NA> <NA>
#> 3 3 merge-hidden <NA> <NA> <NA>
#> 4 4 empty empty merge-lead merge-hidden
#> 5 5 empty empty merge-hidden merge-hidden
merge_cells |>
dplyr::filter(cell_type == "merge-lead") |>
dplyr::select(address, cell_content, merge_colspan, merge_rowspan, merge_shape)
#> # A tibble: 3 × 5
#> address cell_content merge_colspan merge_rowspan merge_shape
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 A1 "A cell across two columns (a… 2 1 horizontal
#> 2 A2 "A cell across two rows (a ve… 1 2 vertical
#> 3 C4 "A cell across two columns an… 2 2 rectangular
Style and formatting information is stored separately from cells themselves, instead cells, rows and columns are given reference identifiers to look up in the style information. There are four style identifiers for each cell:
cell_style
, style and formatting specifically applied
to that cellrow_style
, style and formatting applied to the rowcol_style
, style and formatting applied to columncol_default_cell_style
, the default style and
formatting for cells in the column, i.e. when a cell_style
is not setNote: While these style identifiers are extracted,
at present the {tidyods}
package does not currently support
the extraction of the style and formatting rules from the ODS XML.
At its core an ODS file is an XML file, extracting this information
can be processor heavy. If your primary interest is just in cell
location and cell values then setting the argument
quick = TRUE
can result in a faster extraction process as
it provides only six columns: sheet
, address
,
row
, col
, value_type
and
base_value
.
The quick extraction process varies in how the
base_value
column is constructed. When
quick = TRUE
only float and percentage values are taken in
their raw numeric form, for all other types the
cell_content
is used. That means that for currency, date
and time value types the value as formatted for spreadsheet application
users is returned rather than the “raw” value stored in the underlying
XML file.
read_ods_cells(example_file, 2, quick = TRUE) |>
dplyr::filter(row == 3)
#> # A tibble: 10 × 6
#> sheet address row col value_type base_value
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 types A3 3 1 string Cell with comment
#> 2 types B3 3 2 boolean FALSE
#> 3 types C3 3 3 currency £1.20
#> 4 types D3 3 4 date 06/15/22
#> 5 types E3 3 5 time 13:24
#> 6 types F3 3 6 date 15/06/22 13:24
#> 7 types G3 3 7 float 12034.56789
#> 8 types H3 3 8 percentage 0.5467
#> 9 types I3 3 9 float 6579.298265463
#> 10 types J3 3 10 string #DIV/0!
Boolean values (TRUE
/FALSE
) are equivalent
to R’s logical()
data type.
types_cells |>
dplyr::filter(value_type == "boolean") |>
dplyr::select(cell_content, base_value, boolean_value)
#> # A tibble: 14 × 3
#> cell_content base_value boolean_value
#> <chr> <chr> <lgl>
#> 1 TRUE true TRUE
#> 2 FALSE false FALSE
#> 3 TRUE true TRUE
#> 4 TRUE true TRUE
#> 5 FALSE false FALSE
#> 6 FALSE false FALSE
#> 7 TRUE true TRUE
#> 8 FALSE false FALSE
#> 9 FALSE false FALSE
#> 10 TRUE true TRUE
#> 11 FALSE false FALSE
#> 12 TRUE true TRUE
#> 13 TRUE true TRUE
#> 14 TRUE true TRUE
Note: ODS files produced by Google Sheets output
boolean values differently from other applications. Google Sheets stores
booleans as float
values of 0
for
FALSE
and 1
for TRUE
, and gives
the cells a formula of =FALSE()
and =TRUE()
instead. read_ods_cells()
will output the value type,
numeric value and formula written by Google Sheets, however it uses the
formula references to input the relevant
TRUE
/FALSE
value to the
boolean_value
column.
Currency values are numeric values with a
currency_symbol
, the specific formatting of the currency
symbol is handled by style formatting rules
types_cells |>
dplyr::filter(value_type == "currency") |>
dplyr::select(cell_content, base_value, numeric_value, currency_symbol)
#> # A tibble: 14 × 4
#> cell_content base_value numeric_value currency_symbol
#> <chr> <chr> <dbl> <chr>
#> 1 £1.20 1.2 1.2 GBP
#> 2 £1.20 1.2 1.2 GBP
#> 3 £1.20 1.2 1.2 GBP
#> 4 £1.20 1.2 1.2 GBP
#> 5 1.20 € 1.2 1.2 EUR
#> 6 1.20 € 1.2 1.2 EUR
#> 7 €1.20 1.2 1.2 EUR
#> 8 $1.20 1.2 1.2 CAD
#> 9 $1.20 1.2 1.2 USD
#> 10 1.20 USD 1.2 1.2 USD
#> 11 £1.20 1.2 1.2 GBP
#> 12 £1.20 1.2 1.2 GBP
#> 13 £1.20 1.2 1.2 GBP
#> 14 £1.20 1.2 1.2 GBP
The ODS specification does not differentiate between a date
(e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47). The
cell_content
column will show the date as formatted for
spreadsheet users, e.g. Wednesday, 15 June 2022
or
15/06/22
. The date_value
(and thus
base_value
) is stored in ISO 8601 format
(0000-00-00T00:00:00
).
types_cells |>
dplyr::filter(value_type == "date") |>
dplyr::select(cell_content, base_value, date_value)
#> # A tibble: 28 × 3
#> cell_content base_value date_value
#> <chr> <chr> <chr>
#> 1 15/06/22 2022-06-15 2022-06-15
#> 2 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 3 06/15/22 2022-06-15 2022-06-15
#> 4 15/06/22 13:24 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 5 Wed 15 Jun 22 2022-06-15 2022-06-15
#> 6 15-Jun-22 13:24 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 7 Wednesday 15 June 2022 2022-06-15 2022-06-15
#> 8 Wednesday 15 June 2022 01:24 PM 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 9 15-Jun-22 2022-06-15 2022-06-15
#> 10 15/06 13.24 2022-06-15T13:24:56 2022-06-15T13:24:56
#> # ℹ 18 more rows
As the date value type can contain both date and date-time values the
date_value
column is not coerced by
read_ods_cells()
into a base R data type. While you can use
base R’s as.Date()
to coerce to a date, if processing
date-times you need to replace the T
marker stored in
date_value
with a space in order for
as.POSIXct()
to coerce date-times correctly. Alternatively,
you can use lubridate::as_datetime()
without modifying the
date_value
.
types_cells |>
dplyr::filter(value_type == "date") |>
dplyr::select(cell_content, date_value) |>
dplyr::filter(!grepl("T", date_value)) |>
dplyr::mutate(date_value2 = as.Date(date_value))
#> # A tibble: 14 × 3
#> cell_content date_value date_value2
#> <chr> <chr> <date>
#> 1 15/06/22 2022-06-15 2022-06-15
#> 2 06/15/22 2022-06-15 2022-06-15
#> 3 Wed 15 Jun 22 2022-06-15 2022-06-15
#> 4 Wednesday 15 June 2022 2022-06-15 2022-06-15
#> 5 15-Jun-22 2022-06-15 2022-06-15
#> 6 15/06/2022 2022-06-15 2022-06-15
#> 7 15. Jun. 2022 2022-06-15 2022-06-15
#> 8 15 Jun 22 2022-06-15 2022-06-15
#> 9 mer. 15 juin 22 2022-06-15 2022-06-15
#> 10 2022-06-15 2022-06-15 2022-06-15
#> 11 15/06/22 2022-06-15 2022-06-15
#> 12 15/06/22 2022-06-15 2022-06-15
#> 13 15/06/22 2022-06-15 2022-06-15
#> 14 15/06/22 2022-06-15 2022-06-15
types_cells |>
dplyr::filter(value_type == "date") |>
dplyr::select(cell_content, date_value) |>
dplyr::filter(grepl("T", date_value)) |>
# dplyr::mutate(date_value2 = as.POSIXct(gsub("T", " ", y)))
dplyr::mutate(date_value2 = lubridate::as_datetime(date_value))
#> # A tibble: 14 × 3
#> cell_content date_value date_value2
#> <chr> <chr> <dttm>
#> 1 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 2 15/06/22 13:24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 3 15-Jun-22 13:24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 4 Wednesday 15 June 2022 01:24 PM 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 5 15/06 13.24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 6 2022-06-15 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 7 20220615-132456 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 8 15-Jun 1:24 PM 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 9 15 juin 2022 13:24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 10 06/15/22 13-24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 11 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 12 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 13 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 14 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
The W3C XML schema has more details on the specifics of ISO formatting of dates and date-times.
Note: ODS files produced by Microsoft Excel output
all dates as date-times with the time component of the date set to
00:00:00
, this is a further reason for
read_ods_cells()
not automatically coercing
date_value
to an R data type.
Numeric values, with the exception of currency and percentage types,
are defined as a float
value type.
types_cells |>
dplyr::filter(value_type == "float") |>
dplyr::select(cell_content, base_value, numeric_value)
#> # A tibble: 37 × 3
#> cell_content base_value numeric_value
#> <chr> <chr> <dbl>
#> 1 12035 12034.56789 12034.56789
#> 2 0.5467 0.5467 0.5467
#> 3 6579.3 6579.298265463 6579.298265
#> 4 12034.57 12034.56789 12034.56789
#> 5 6579.3 6579.298265463 6579.298265
#> 6 12034.5679 12034.56789 12034.56789
#> 7 6579.3 6579.298265463 6579.298265
#> 8 12034.568 12034.56789 12034.56789
#> 9 6579.3 6579.298265463 6579.298265
#> 10 12034.56789 12034.56789 12034.56789
#> # ℹ 27 more rows
Percentages are numeric values with formatting to include a percentage sign. As with the currency value type the specifics of formatting are handled by style formatting rules.
types_cells |>
dplyr::filter(value_type == "percentage") |>
dplyr::select(cell_content, base_value, numeric_value)
#> # A tibble: 5 × 3
#> cell_content base_value numeric_value
#> <chr> <chr> <dbl>
#> 1 55% 0.5467 0.5467
#> 2 54.7% 0.5467 0.5467
#> 3 54.67% 0.5467 0.5467
#> 4 54.670% 0.5467 0.5467
#> 5 54.6700% 0.5467 0.5467
String values are derived from the cell_content
, the
text representation of the cell as seen by a spreadsheet application
user.
types_cells |>
dplyr::filter(col == 1 & row > 1 & value_type == "string") |>
dplyr::select(cell_content, base_value)
#> # A tibble: 14 × 2
#> cell_content base_value
#> <chr> <chr>
#> 1 "Cell" "Cell"
#> 2 "Cell with comment" "Cell with comment"
#> 3 "Cell with\nMultiple lines" "Cell with\nMultiple lines"
#> 4 "Cell with repeated spaces" "Cell with repeated spaces"
#> 5 "Cells with spaces\nAnd multiple lines" "Cells with spaces\nAnd mu…
#> 6 "Cell with formatted comment" "Cell with formatted comment"
#> 7 "Merged cell" "Merged cell"
#> 8 "Merged cell with hidden text" "Merged cell with hidden text"
#> 9 "Hidden text" "Hidden text"
#> 10 "Cell with bold text" "Cell with bold text"
#> 11 "Repeated rows" "Repeated rows"
#> 12 "Repeated rows" "Repeated rows"
#> 13 "Repeated rows" "Repeated rows"
#> 14 "une cellule dans une feuille de calcul" "une cellule dans une feuille …
While the ODS specification allows for a separate
string-value
attribute in the XML table-cell tags, in
practice this is not implemented by the major spreadsheet
applications.
Time values are defined as a duration as specified by ISO 8601, in
the format PT00H00M00S
.
types_cells |>
dplyr::filter(value_type == "time") |>
dplyr::select(cell_content, base_value, time_value)
#> # A tibble: 14 × 3
#> cell_content base_value time_value
#> <chr> <chr> <chr>
#> 1 13:24:56 PT13H24M56S PT13H24M56S
#> 2 13:24 PT13H24M56S PT13H24M56S
#> 3 01:24 PM PT13H24M56S PT13H24M56S
#> 4 1:24 PM PT13H24M56S PT13H24M56S
#> 5 1.24 PM PT13H24M56S PT13H24M56S
#> 6 13h24 PT13H24M56S PT13H24M56S
#> 7 1324 PT13H24M56S PT13H24M56S
#> 8 13 24 PT13H24M56S PT13H24M56S
#> 9 13-24-56 PT13H24M56S PT13H24M56S
#> 10 13:24:56.00 PT13H24M56S PT13H24M56S
#> 11 13:24:56 PT13H24M56S PT13H24M56S
#> 12 13:24:56 PT13H24M56S PT13H24M56S
#> 13 13:24:56 PT13H24M56S PT13H24M56S
#> 14 13:24:56 PT13H24M56S PT13H24M56S
As with date values time values are not automatically coerced to an R
data type. To use base R’s as.difftime()
you will need to
remove or convert the letter characters, whereas
lubridate::as.duration()
can coerce without
modification.
types_cells |>
dplyr::filter(value_type == "time") |>
dplyr::select(cell_content, time_value) |>
# dplyr::mutate(
# time_value2 = as.difftime(gsub("[PTS]", "", gsub("[HM]", ":", time_value)))
# )
dplyr::mutate(time_value2 = lubridate::as.duration(time_value))
#> # A tibble: 14 × 3
#> cell_content time_value time_value2
#> <chr> <chr> <Duration>
#> 1 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 2 13:24 PT13H24M56S 48296s (~13.42 hours)
#> 3 01:24 PM PT13H24M56S 48296s (~13.42 hours)
#> 4 1:24 PM PT13H24M56S 48296s (~13.42 hours)
#> 5 1.24 PM PT13H24M56S 48296s (~13.42 hours)
#> 6 13h24 PT13H24M56S 48296s (~13.42 hours)
#> 7 1324 PT13H24M56S 48296s (~13.42 hours)
#> 8 13 24 PT13H24M56S 48296s (~13.42 hours)
#> 9 13-24-56 PT13H24M56S 48296s (~13.42 hours)
#> 10 13:24:56.00 PT13H24M56S 48296s (~13.42 hours)
#> 11 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 12 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 13 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 14 13:24:56 PT13H24M56S 48296s (~13.42 hours)
The W3C XML schema has more details on the specifics of ISO formatting of dates and date-times.