This is a thin wrapper on data.table::fread, but memoised & cached for twenty four hours.
Arguments
- ...
Arguments passed on to
data.table::fread
input
A single character string. The value is inspected and deferred to either
file=
(if no \n present),text=
(if at least one \n is present) orcmd=
(if no \n is present, at least one space is present, and it isn't a file name). Exactly one ofinput=
,file=
,text=
, orcmd=
should be used in the same call.file
File name in working directory, path to file (passed through
path.expand
for convenience), or a URL starting http://, file://, etc. Compressed files with extension.gz
and.bz2
are supported if theR.utils
package is installed.text
The input data itself as a character vector of one or more lines, for example as returned by
readLines()
.cmd
A shell command that pre-processes the file; e.g.
fread(cmd=paste("grep",word,"filename"))
. See Details.sep
The separator between columns. Defaults to the character in the set
[,\t |;:]
that separates the sample of rows into the most number of lines with the same number of fields. UseNULL
or""
to specify no separator; i.e. each line a single character column likebase::readLines
does.sep2
The separator within columns. A
list
column will be returned where each cell is a vector of values. This is much faster using less working memory thanstrsplit
afterwards or similar techniques. For each columnsep2
can be different and is the first character in the same set above [,\t |;
], other thansep
, that exists inside each field outside quoted regions in the sample. NB:sep2
is not yet implemented.nrows
The maximum number of rows to read. Unlike
read.table
, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined byfread
almost instantly using the large sample of lines.nrows=0
returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.header
Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name.
na.strings
A character vector of strings which are to be interpreted as
NA
values. By default,",,"
for columns of all types, including typecharacter
is read asNA
for consistency.,"",
is unambiguous and read as an empty string. To read,NA,
asNA
, setna.strings="NA"
. To read,,
as blank string""
, setna.strings=NULL
. When they occur in the file, the strings inna.strings
should not appear quoted since that is how the string literal,"NA",
is distinguished from,NA,
, for example, whenna.strings="NA"
.stringsAsFactors
Convert all or some character columns to factors? Acceptable inputs are
TRUE
,FALSE
, or a decimal value between 0.0 and 1.0. ForstringsAsFactors = FALSE
, all string columns are stored ascharacter
vs. all stored asfactor
whenTRUE
. WhenstringsAsFactors = p
for0 <= p <= 1
, string columnscol
are stored asfactor
ifuniqueN(col)/nrow < p
.verbose
Be chatty and report timings?
skip
If 0 (default) start on the first line and from there finds the first row with a consistent number of columns. This automatically avoids irregular header information before the column names row.
skip>0
means ignore the firstskip
rows manually.skip="string"
searches for"string"
in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).select
A vector of column names or numbers to keep, drop the rest.
select
may specify types too in the same way ascolClasses
; i.e., a vector ofcolname=type
pairs, or alist
oftype=col(s)
pairs. In all forms ofselect
, the order that the columns are specified determines the order of the columns in the result.drop
Vector of column names or numbers to drop, keep the rest.
colClasses
As in
utils::read.csv
; i.e., an unnamed vector of types corresponding to the columns in the file, or a named vector specifying types for a subset of the columns by name. The default,NULL
means types are inferred from the data in the file. Further,data.table
supports a namedlist
of vectors of column names or numbers where thelist
names are the class names; see examples. Thelist
form makes it easier to set a batch of columns to be a particular class. When column numbers are used in thelist
form, they refer to the column number in the file not the column number afterselect
ordrop
has been applied. If type coercion results in an error, introducesNA
s, or would result in loss of accuracy, the coercion attempt is aborted for that column with warning and the column's type is left unchanged. If you really desire data loss (e.g. reading3.14
asinteger
) you have to truncate such columns afterwards yourself explicitly so that this is clear to future readers of your code.integer64
"integer64" (default) reads columns detected as containing integers larger than 2^31 as type
bit64::integer64
. Alternatively,"double"|"numeric"
reads asutils::read.csv
does; i.e., possibly with loss of precision and if so silently. Or, "character".dec
The decimal separator as in
utils::read.csv
. When"auto"
(the default), an attempt is made to decide whether"."
or","
is more suitable for this input. See details.col.names
A vector of optional names for the variables (columns). The default is to use the header column if present or detected, or if not "V" followed by the column number. This is applied after
check.names
and beforekey
andindex
.check.names
default is
FALSE
. IfTRUE
then the names of the variables in thedata.table
are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (bymake.names
) so that they are, and also to ensure that there are no duplicates.encoding
default is
"unknown"
. Other possible options are"UTF-8"
and"Latin-1"
. Note: it is not used to re-encode the input, rather enables handling of encoded strings in their native encoding.quote
By default (
"\""
), if a field starts with a double quote,fread
handles embedded quotes robustly as explained underDetails
. If it fails, then another attempt is made to read the field as is, i.e., as if quotes are disabled. By settingquote=""
, the field is always read as if quotes are disabled. It is not expected to ever need to pass anything other than \"\" to quote; i.e., to turn it off.strip.white
Logical, default
TRUE
, in which case leading and trailing whitespace is stripped from unquoted"character"
fields."numeric"
fields are always stripped of leading and trailing whitespace.fill
logical or integer (default is
FALSE
). IfTRUE
then in case the rows have unequal length, number of columns is estimated and blank fields are implicitly filled. If an integer is provided it is used as an upper bound for the number of columns. Iffill=Inf
then the whole file is read for detecting the number of columns.blank.lines.skip
logical
, default isFALSE
. IfTRUE
blank lines in the input are ignored.key
Character vector of one or more column names which is passed to
setkey
. Only valid when argumentdata.table=TRUE
. Where applicable, this should refer to column names given incol.names
.index
Character vector or list of character vectors of one or more column names which is passed to
setindexv
. As withkey
, comma-separated notation likeindex="x,y,z"
is accepted for convenience. Only valid when argumentdata.table=TRUE
. Where applicable, this should refer to column names given incol.names
.showProgress
TRUE
displays progress on the console if the ETA is greater than 3 seconds. It is produced in fread's C code where the very nice (but R level) txtProgressBar and tkProgressBar are not easily available.data.table
TRUE returns a
data.table
. FALSE returns adata.frame
. The default for this argument can be changed withoptions(datatable.fread.datatable=FALSE)
.nThread
The number of threads to use. Experiment to see what works best for your data on your hardware.
logical01
If TRUE a column containing only 0s and 1s will be read as logical, otherwise as integer.
keepLeadingZeros
If TRUE a column containing numeric data with leading zeros will be read as character, otherwise leading zeros will be removed and converted to numeric.
yaml
If
TRUE
,fread
will attempt to parse (usingyaml.load
) the top of the input as YAML, and further to glean parameters relevant to improving the performance offread
on the data itself. The entire YAML section is returned as parsed into alist
in theyaml_metadata
attribute. SeeDetails
.autostart
Deprecated and ignored with warning. Please use
skip
instead.tmpdir
Directory to use as the
tmpdir
argument for anytempfile
calls, e.g. when the input is a URL or a shell command. The default istempdir()
which can be controlled by settingTMPDIR
before starting the R session; seebase::tempdir
.tz
Relevant to datetime values which have no Z or UTC-offset at the end, i.e. unmarked datetime, as written by
utils::write.csv
. The defaulttz="UTC"
reads unmarked datetime as UTC POSIXct efficiently.tz=""
reads unmarked datetime as type character (slowly) so thatas.POSIXct
can interpret (slowly) the character datetimes in local timezone; e.g. by using"POSIXct"
incolClasses=
. Note thatfwrite()
by default writes datetime in UTC including the final Z and thereforefwrite
's output will be read byfread
consistently and quickly without needing to usetz=
orcolClasses=
. If theTZ
environment variable is set to"UTC"
(or""
on non-Windows where unset vs `""` is significant) then the R session's timezone is already UTC andtz=""
will result in unmarked datetimes being read as UTC POSIXct. For more information, please see the news items from v1.13.0 and v1.14.0.
Value
a dataframe as created by data.table::fread()
Examples
# \donttest{
try({ # prevents cran errors
csv_from_url("https://github.com/nflverse/nflverse-data/releases/download/test/combines.csv")
})
#> season draft_year draft_team draft_round draft_ovr pfr_id
#> <int> <int> <char> <int> <int> <char>
#> 1: 2000 2000 New York Jets 1 13 AbraJo00
#> 2: 2000 2000 Seattle Seahawks 1 19 AlexSh00
#> 3: 2000 2000 Kansas City Chiefs 6 188 AlfoDa20
#> 4: 2000 NA NA NA
#> 5: 2000 2000 Carolina Panthers 1 23 AndeRa21
#> ---
#> 7676: 2022 2022 Green Bay Packers 1 28 WyatDe00
#> 7677: 2022 NA NA NA WydeJa00
#> 7678: 2022 2022 Cleveland Browns 4 124 YorkCa00
#> 7679: 2022 NA NA NA
#> 7680: 2022 2022 New England Patriots 4 137 ZappBa00
#> cfb_id player_name pos school ht wt
#> <char> <char> <char> <char> <char> <int>
#> 1: John Abraham OLB South Carolina 6-4 252
#> 2: shaun-alexander-1 Shaun Alexander RB Alabama 6-0 218
#> 3: Darnell Alford OT Boston Col. 6-4 334
#> 4: Kyle Allamon TE Texas Tech 6-2 253
#> 5: Rashard Anderson CB Jackson State 6-2 206
#> ---
#> 7676: devonte-wyatt-1 Devonte Wyatt DT Georgia 6-3 304
#> 7677: jalen-wydermyer-1 Jalen Wydermyer TE Texas A&M 6-4 255
#> 7678: cade-york-1 Cade York K LSU 6-1 206
#> 7679: Nick Zakelj OT Fordham 6-6 316
#> 7680: bailey-zappe-1 Bailey Zappe QB Western Kentucky 6-1 215
#> forty bench vertical broad_jump cone shuttle
#> <num> <int> <num> <int> <num> <num>
#> 1: 4.55 NA NA NA NA NA
#> 2: 4.58 NA NA NA NA NA
#> 3: 5.56 23 25.0 94 8.48 4.98
#> 4: 4.97 NA 29.0 104 7.29 4.49
#> 5: 4.55 NA 34.0 123 7.18 4.15
#> ---
#> 7676: 4.77 NA 29.0 111 NA NA
#> 7677: NA NA NA NA NA NA
#> 7678: NA 12 NA NA NA NA
#> 7679: 5.13 27 28.5 110 7.75 4.71
#> 7680: 4.88 NA 30.0 109 7.19 4.40
# }