--- title: "Extracting cell information" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Extracting cell information} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(tidyods) ``` The aim of `{tidyods}` is to extract information about cells in OpenDocument Spreadsheet (ODS) files into a "tidy" format. In addition to the value contained by a cell it extracts a range of addition metadata about the cells. This vignette explains how to use `tidyods::read_ods_cells()` and the types of information about cells it extracts. The [*Spreadsheet Munging Strategies*](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) book by Duncan Garmonsway explores in more detail how you can use datasets of "tidy cells" to work with non-tabular data stored in (Excel) spreadsheets. ## Reading data from cells You read in data from an ODS file with the `read_ods_cells()` function. The resulting tibble provides 28 columns of information about cells, with each row representing an individual cell in the input spreadsheet. ```{r} example_file <- system.file("extdata", "basic_example.ods", package = "tidyods") types_cells <- read_ods_cells(example_file, 2) dplyr::glimpse(types_cells) ``` There are four arguments to `read_ods_cells()`: - `path`, the path to the ODS file - `sheet`, the sheet you wish to extract - `quick`, whether to extract a subset of cell information - `quiet`, whether to show console messages or not The `sheet` argument can either be the index number of a sheet (ranging from 1 to the total number of sheets in the spreadsheet document) or the name of a sheet (e.g. `"types"` instead of `2` in the example above). The `quick` argument extracts only a small number of columns and is useful if you are largely concerned with cell values; see the [quick extraction](#quick-extraction) section for more details. The `quiet` argument turns off console messages, ODS files are written in XML which can take significant time to process and so progress messages are provided to the console for interactive users. It is advised to set this to false if you are doing batch operations. ## Cell information There are 28 types of information returned by `read_ods_cells()`, these can be grouped into three categories: location information, value information, and other metadata. ### Location information The first four columns of information relate to a cell's location in the spreadsheet. - `sheet`, the name of the sheet the cell belongs to - `address`, the address of the cell within a sheet using "A1" notation - `row`, the row number of the cell - `col`, the column number of the cell The `address` column is derived from the `row` and `col` using the `{cellranger}` package. It provides the address in the "A1" notation commonly used in spreadsheet applications where the letter corresponds to a column and the number corresponds to a row. The "A1" notation is used by ODS files for calculation formulas. ### Value information There are 10 columns that provide information about a cell's value. - `cell_type`, the type of cell - `is_empty`, whether the cell has content or not (i.e. `cell_type == "empty"`) - `value_type`, the value type as set in the ODS file - `cell_content`, the cell's value as formatted for users of spreadsheet applications - `base_value`, the "raw" value of the cell - `numeric_value`, for numeric value types the raw value as a `numeric()` vector - `currency_symbol`, for currency value types the currency symbol associated with the numeric value - `boolean_value`, for boolean value types the raw value as a `logical()` vector - `date_value`, for date value types (dates and date-times) the "raw" value as a `character()` vector - `time_value`, for time value types, the "raw" value as a `character()` vector Cells have one of four types, it is `empty` if it has no content and/or no value type, it is a `merge-lead` if it is the top-left cell of a merge group while other cells in the merge group are classified as `merge-hidden`, all other cells are simply a `cell`. The `value_type` is derived from the [ODS specification](https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#__RefHeading__1417680_253892949) which defines seven value types: - `boolean`, equivalent to R's `logical()` data type - `currency`, a numeric value, optionally with a currency symbol - `date`, a date or date-time stored in ISO 8601 format, the ODS specification does not differentiate between a date (e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47) - `float`, a numeric value, the ODS format does not distinguish between different numeric types - `percentage`, a numeric value, but formatted for display as a percentage - `string`, equivalent to R's `character()` - `time`, a duration, stored in ISO 8601 format The `cell_content` column provides the value of a cell as displayed and formatted to users of a spreadsheet application. Non-string values are stored separately in the underlying ODS XML file, and these are extracted as separate columns (`numeric_value`, `boolean_value`, `date_value` and `time_value`) by `read_ods_cells()`, for numeric and boolean values these are also coerced into a `numeric()` and `logical()` vector. The `base_value` column provides a combination of the `cell_content` for cells that are strings and for cells that are not string value types the value from the relevant type specific column. For currency value types optionally a currency symbol (e.g. GBP, EUR, USD, JPY) can be set, this is provided in the `currency_symbol` column. ```{r} types_cells |> dplyr::filter(row == 3) |> dplyr::select(value_type:currency_symbol) types_cells |> dplyr::filter(row == 3) |> dplyr::select(value_type, base_value, boolean_value:time_value) ``` The [value type examples](#value-types-examples) section below provides more details on the different value types and how they are stored and processed. ### Other metadata Finally, `read_ods_cells()` provides 12 further columns with additional metadata about each cell. - `has_formula`, whether the cell value is calculated by a formula - `formula`, the formula used to calculate the cell value - `has_error`, whether the cell value or formula results in an error - `error_type`, the type of error the cell is experiencing - `has_annotation`, whether the cell has a comment/annotation - `annotation`, the content of the cell's comment/annotation - `is_merged`, whether the cell is part of a merge group - `merge_colspan`, for the lead cell in a merge group the number of columns spanned by the merge group - `merge_rowspan`, for the lead cell in a merge group the number of rows spanned by the merge group - `merge_shape`, for the lead cell in a merge group the shape of the merge group - `cell_style`, the style reference for the cell - `row_style`, the style reference for the row - `col_style`, the style reference for the column - `col_deftaul_cell_style`, the default style reference for cells in the column #### Formulas The `has_formula` column indicates if a cell value is calculated by a formula, which is itself provided in the `formula` column. ```{r} types_cells |> dplyr::filter(col == 9 & has_formula) |> dplyr::select(value_type, cell_content, numeric_value, has_formula, formula) |> head() ``` For more on the use of formulas in ODS files you can consult the [OpenFormula specification](https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part4-formula/OpenDocument-v1.3-os-part4-formula.html). #### Errors Formulas in spreadsheets can result in calculation errors. The `has_error` column indicates if there the cell results in an error. The `error_type` column is similar to the `=ERROR.TYPE()` function in Microsoft Excel, Google Sheets and LibreOffice, these provide a number to denote the type of error. `error_type` will have a value from 1 to 7 or 501 to 540: - `1` for #NULL! errors in Excel/Google Sheets or Err:511 in LibreOffice, typically this is some sort of error in the formula specification such as a missing operator (Excel/Google Sheets) or a missing variable (LibreOffice) - `2` for #DIV/0! (division by zero) errors - `3` for #VALUE! (value type) errors, e.g. when a formula uses a string rather than a number - `4` for #REF! (invalid cell reference) errors - `5` for #NAME? errors, e.g. formula specifies a function that doesn't exist - `6` for #NUM! (calculation overflow) errors, e.g. the result of `9999^9999` - `7` for #N/A (logic) errors, e.g. unable to match values - `501` to `540` a LibreOffice error LibreOffice assigns all errors an error code from 501 to 540, where not displayed as an error like `#DIV/0!` these are shown in `cell_content` as `"Err:5##"`. See LibreOffice's documentation on [error codes](https://help.libreoffice.org/7.3/en-GB/text/scalc/05/02140000.html) for more details on these error codes. #### Annotations Spreadsheet applications allow you to add notes and comments to cells to provide addition contextual information. The `has_annotation` column indicates whether the cell has an annotation associated with it. The `annotation` column provides the text content of the annotation. ```{r} types_cells |> dplyr::filter(has_annotation) |> dplyr::select(has_annotation, annotation) ``` **Note:** different spreadsheet applications handle annotations in different ways, `read_ods_cells()` does not take account of these differences and provides only a basic representation of the text contained in the annotation. LibreOffice only supports a basic "comment" annotation which is called a "note" in Microsoft Excel and Google Sheets. In addition to a "note" Microsoft Excel and Google Sheets proprietary formats allow for threaded comments that allow users to leave comments and reply to these comments, when saving into ODS these features are saved as annotations like notes. However the coercion of these threads into a note and inclusion of comment/reply author details varies. #### Merged cells Spreadsheets allow you to merge cells, largely this is for presentational purposes. Merged cells have a `cell_type` of either `merge-lead` for the top-left cell of a merge group or `merge-hidden` for a cell hidden by a merge group. The `is_merged` indicates whether the cell is either of these types. When merging cells in Microsoft Office and Google Sheets then only content in the top-left cell is retained, however in LibreOffice merging cells retains the content of the cells that are hidden by a merge operation. ```{r} types_cells |> dplyr::filter(is_merged) |> dplyr::select(address, cell_type, cell_content, is_merged) ``` For `merge-lead` cells you will also get the number of columns and rows spanned by the merge group in `merge_colspan` and `merge_rowspan` respectively. This span information is used to define the `merge_shape` column: - the shape of a merge group is defined as `vertical` when `merge_colspan` is equal to 1 and `merge_rowspan` is greater than 1 - the shape of a merge group is defined as `horizontal` when `merge_colspan` is greater than 1 and `merge_rowspan` is equal to 1 - otherwise the shape of the merge group is `rectangular`.
A B C D
1 A cell across two columns (a horizontal merge) C1 D1
2 A cell across two rows (a vertical merge) B2 C2 D2
3 B3 C3 D3
4 A4 B4 A cell across two columns and two rows (a rectangular merge)
5 A5 B5
```{r} merge_cells <- read_ods_cells(example_file, 3) merge_cells |> dplyr::select(row, col, cell_type) |> dplyr::mutate(col = paste0("x", col)) |> tidyr::pivot_wider(names_from = col, values_from = cell_type) merge_cells |> dplyr::filter(cell_type == "merge-lead") |> dplyr::select(address, cell_content, merge_colspan, merge_rowspan, merge_shape) ``` ### Style information Style and formatting information is stored separately from cells themselves, instead cells, rows and columns are given reference identifiers to look up in the style information. There are four style identifiers for each cell: - `cell_style`, style and formatting specifically applied to that cell - `row_style`, style and formatting applied to the row - `col_style`, style and formatting applied to column - `col_default_cell_style`, the default style and formatting for cells in the column, i.e. when a `cell_style` is not set **Note:** While these style identifiers are extracted, at present the `{tidyods}` package does not currently support the extraction of the style and formatting rules from the ODS XML. ## Quick extraction At its core an ODS file is an XML file, extracting this information can be processor heavy. If your primary interest is just in cell location and cell values then setting the argument `quick = TRUE` can result in a faster extraction process as it provides only six columns: `sheet`, `address`, `row`, `col`, `value_type` and `base_value`. The quick extraction process varies in how the `base_value` column is constructed. When `quick = TRUE` only float and percentage values are taken in their raw numeric form, for all other types the `cell_content` is used. That means that for currency, date and time value types the value as formatted for spreadsheet application users is returned rather than the "raw" value stored in the underlying XML file. ```{r} read_ods_cells(example_file, 2, quick = TRUE) |> dplyr::filter(row == 3) ``` ## Value types examples ### Boolean Boolean values (`TRUE`/`FALSE`) are equivalent to R's `logical()` data type. ```{r} types_cells |> dplyr::filter(value_type == "boolean") |> dplyr::select(cell_content, base_value, boolean_value) ``` **Note:** ODS files produced by Google Sheets output boolean values differently from other applications. Google Sheets stores booleans as `float` values of `0` for `FALSE` and `1` for `TRUE`, and gives the cells a formula of `=FALSE()` and `=TRUE()` instead. `read_ods_cells()` will output the value type, numeric value and formula written by Google Sheets, however it uses the formula references to input the relevant `TRUE`/`FALSE` value to the `boolean_value` column. ### Currency Currency values are numeric values with a `currency_symbol`, the specific formatting of the currency symbol is handled by style formatting rules ```{r} types_cells |> dplyr::filter(value_type == "currency") |> dplyr::select(cell_content, base_value, numeric_value, currency_symbol) ``` ### Date The ODS specification does not differentiate between a date (e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47). The `cell_content` column will show the date as formatted for spreadsheet users, e.g. `Wednesday, 15 June 2022` or `15/06/22`. The `date_value` (and thus `base_value`) is stored in ISO 8601 format (`0000-00-00T00:00:00`). ```{r} types_cells |> dplyr::filter(value_type == "date") |> dplyr::select(cell_content, base_value, date_value) ``` As the date value type can contain both date and date-time values the `date_value` column is not coerced by `read_ods_cells()` into a base R data type. While you can use base R's `as.Date()` to coerce to a date, if processing date-times you need to replace the `T` marker stored in `date_value` with a space in order for `as.POSIXct()` to coerce date-times correctly. Alternatively, you can use `lubridate::as_datetime()` without modifying the `date_value`. ```{r} types_cells |> dplyr::filter(value_type == "date") |> dplyr::select(cell_content, date_value) |> dplyr::filter(!grepl("T", date_value)) |> dplyr::mutate(date_value2 = as.Date(date_value)) types_cells |> dplyr::filter(value_type == "date") |> dplyr::select(cell_content, date_value) |> dplyr::filter(grepl("T", date_value)) |> # dplyr::mutate(date_value2 = as.POSIXct(gsub("T", " ", y))) dplyr::mutate(date_value2 = lubridate::as_datetime(date_value)) ``` The [W3C XML schema](https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#isoformats) has more details on the specifics of ISO formatting of dates and date-times. **Note:** ODS files produced by Microsoft Excel output all dates as date-times with the time component of the date set to `00:00:00`, this is a further reason for `read_ods_cells()` not automatically coercing `date_value` to an R data type. ### Float Numeric values, with the exception of currency and percentage types, are defined as a `float` value type. ```{r} #| echo: FALSE old <- options( pillar.sigfig = 10 ) ``` ```{r} types_cells |> dplyr::filter(value_type == "float") |> dplyr::select(cell_content, base_value, numeric_value) ``` ### Percentage Percentages are numeric values with formatting to include a percentage sign. As with the currency value type the specifics of formatting are handled by style formatting rules. ```{r} types_cells |> dplyr::filter(value_type == "percentage") |> dplyr::select(cell_content, base_value, numeric_value) ``` ```{r} #| echo: FALSE options(old) ``` ### String String values are derived from the `cell_content`, the text representation of the cell as seen by a spreadsheet application user. ```{r} types_cells |> dplyr::filter(col == 1 & row > 1 & value_type == "string") |> dplyr::select(cell_content, base_value) ``` While the ODS specification allows for a separate `string-value` attribute in the XML table-cell tags, in practice this is not implemented by the major spreadsheet applications. ### Time Time values are defined as a duration as specified by ISO 8601, in the format `PT00H00M00S`. ```{r} types_cells |> dplyr::filter(value_type == "time") |> dplyr::select(cell_content, base_value, time_value) ``` As with date values time values are not automatically coerced to an R data type. To use base R's `as.difftime()` you will need to remove or convert the letter characters, whereas `lubridate::as.duration()` can coerce without modification. ```{r} types_cells |> dplyr::filter(value_type == "time") |> dplyr::select(cell_content, time_value) |> # dplyr::mutate( # time_value2 = as.difftime(gsub("[PTS]", "", gsub("[HM]", ":", time_value))) # ) dplyr::mutate(time_value2 = lubridate::as.duration(time_value)) ``` The [W3C XML schema](https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#isoformats) has more details on the specifics of ISO formatting of dates and date-times.