Skip to contents

This is a thin wrapper on data.table::fread, but memoised & cached for twenty four hours.

Usage

csv_from_url(...)

Arguments

...

Arguments passed on to data.table::fread

input

A single character string. The value is inspected and deferred to either file= (if no \n present), text= (if at least one \n is present) or cmd= (if no \n is present, at least one space is present, and it isn't a file name). Exactly one of input=, file=, text=, or cmd= should be used in the same call.

file

File name in working directory, path to file (passed through path.expand for convenience), or a URL starting http://, file://, etc. Compressed files with extension .gz and .bz2 are supported if the R.utils package is installed.

text

The input data itself as a character vector of one or more lines, for example as returned by readLines().

cmd

A shell command that pre-processes the file; e.g. fread(cmd=paste("grep",word,"filename")). See Details.

sep

The separator between columns. Defaults to the character in the set [,\t |;:] that separates the sample of rows into the most number of lines with the same number of fields. Use NULL or "" to specify no separator; i.e. each line a single character column like base::readLines does.

sep2

The separator within columns. A list column will be returned where each cell is a vector of values. This is much faster using less working memory than strsplit afterwards or similar techniques. For each column sep2 can be different and is the first character in the same set above [,\t |;], other than sep, that exists inside each field outside quoted regions in the sample. NB: sep2 is not yet implemented.

nrows

The maximum number of rows to read. Unlike read.table, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by fread almost instantly using the large sample of lines. nrows=0 returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.

header

Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name.

na.strings

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

stringsAsFactors

Convert all character columns to factors?

verbose

Be chatty and report timings?

skip

If 0 (default) start on the first line and from there finds the first row with a consistent number of columns. This automatically avoids irregular header information before the column names row. skip>0 means ignore the first skip rows manually. skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).

select

A vector of column names or numbers to keep, drop the rest. select may specify types too in the same way as colClasses; i.e., a vector of colname=type pairs, or a list of type=col(s) pairs. In all forms of select, the order that the columns are specified determines the order of the columns in the result.

drop

Vector of column names or numbers to drop, keep the rest.

colClasses

As in utils::read.csv; i.e., an unnamed vector of types corresponding to the columns in the file, or a named vector specifying types for a subset of the columns by name. The default, NULL means types are inferred from the data in the file. Further, data.table supports a named list of vectors of column names or numbers where the list names are the class names; see examples. The list form makes it easier to set a batch of columns to be a particular class. When column numbers are used in the list form, they refer to the column number in the file not the column number after select or drop has been applied. If type coercion results in an error, introduces NAs, or would result in loss of accuracy, the coercion attempt is aborted for that column with warning and the column's type is left unchanged. If you really desire data loss (e.g. reading 3.14 as integer) you have to truncate such columns afterwards yourself explicitly so that this is clear to future readers of your code.

integer64

"integer64" (default) reads columns detected as containing integers larger than 2^31 as type bit64::integer64. Alternatively, "double"|"numeric" reads as utils::read.csv does; i.e., possibly with loss of precision and if so silently. Or, "character".

dec

The decimal separator as in utils::read.csv. If not "." (default) then usually ",". See details.

col.names

A vector of optional names for the variables (columns). The default is to use the header column if present or detected, or if not "V" followed by the column number. This is applied after check.names and before key and index.

check.names

default is FALSE. If TRUE then the names of the variables in the data.table are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.

encoding

default is "unknown". Other possible options are "UTF-8" and "Latin-1". Note: it is not used to re-encode the input, rather enables handling of encoded strings in their native encoding.

quote

By default ("\""), if a field starts with a double quote, fread handles embedded quotes robustly as explained under Details. If it fails, then another attempt is made to read the field as is, i.e., as if quotes are disabled. By setting quote="", the field is always read as if quotes are disabled. It is not expected to ever need to pass anything other than \"\" to quote; i.e., to turn it off.

strip.white

default is TRUE. Strips leading and trailing whitespaces of unquoted fields. If FALSE, only header trailing spaces are removed.

fill

logical (default is FALSE). If TRUE then in case the rows have unequal length, blank fields are implicitly filled.

blank.lines.skip

logical, default is FALSE. If TRUE blank lines in the input are ignored.

key

Character vector of one or more column names which is passed to setkey. It may be a single comma separated string such as key="x,y,z", or a vector of names such as key=c("x","y","z"). Only valid when argument data.table=TRUE. Where applicable, this should refer to column names given in col.names.

index

Character vector or list of character vectors of one or more column names which is passed to setindexv. As with key, comma-separated notation like index="x,y,z" is accepted for convenience. Only valid when argument data.table=TRUE. Where applicable, this should refer to column names given in col.names.

showProgress

TRUE displays progress on the console if the ETA is greater than 3 seconds. It is produced in fread's C code where the very nice (but R level) txtProgressBar and tkProgressBar are not easily available.

data.table

TRUE returns a data.table. FALSE returns a data.frame. The default for this argument can be changed with options(datatable.fread.datatable=FALSE).

nThread

The number of threads to use. Experiment to see what works best for your data on your hardware.

logical01

If TRUE a column containing only 0s and 1s will be read as logical, otherwise as integer.

keepLeadingZeros

If TRUE a column containing numeric data with leading zeros will be read as character, otherwise leading zeros will be removed and converted to numeric.

yaml

If TRUE, fread will attempt to parse (using yaml.load) the top of the input as YAML, and further to glean parameters relevant to improving the performance of fread on the data itself. The entire YAML section is returned as parsed into a list in the yaml_metadata attribute. See Details.

autostart

Deprecated and ignored with warning. Please use skip instead.

tmpdir

Directory to use as the tmpdir argument for any tempfile calls, e.g. when the input is a URL or a shell command. The default is tempdir() which can be controlled by setting TMPDIR before starting the R session; see base::tempdir.

tz

Relevant to datetime values which have no Z or UTC-offset at the end, i.e. unmarked datetime, as written by utils::write.csv. The default tz="UTC" reads unmarked datetime as UTC POSIXct efficiently. tz="" reads unmarked datetime as type character (slowly) so that as.POSIXct can interpret (slowly) the character datetimes in local timezone; e.g. by using "POSIXct" in colClasses=. Note that fwrite() by default writes datetime in UTC including the final Z and therefore fwrite's output will be read by fread consistently and quickly without needing to use tz= or colClasses=. If the TZ environment variable is set to "UTC" (or "" on non-Windows where unset vs `""` is significant) then the R session's timezone is already UTC and tz="" will result in unmarked datetimes being read as UTC POSIXct. For more information, please see the news items from v1.13.0 and v1.14.0.

Value

a dataframe as created by data.table::fread()

Examples

# \donttest{
try({ # prevents cran errors
  csv_from_url("https://github.com/nflverse/nfldata/raw/master/data/games.csv")
})
#>               game_id season game_type week    gameday weekday gametime
#>    1: 1999_01_MIN_ATL   1999       REG    1 1999-09-12  Sunday         
#>    2:  1999_01_KC_CHI   1999       REG    1 1999-09-12  Sunday         
#>    3: 1999_01_PIT_CLE   1999       REG    1 1999-09-12  Sunday         
#>    4:  1999_01_OAK_GB   1999       REG    1 1999-09-12  Sunday         
#>    5: 1999_01_BUF_IND   1999       REG    1 1999-09-12  Sunday         
#>   ---                                                                  
#> 6405: 2022_18_NYG_PHI   2022       REG   18 2023-01-08  Sunday    13:00
#> 6406: 2022_18_CLE_PIT   2022       REG   18 2023-01-08  Sunday    13:00
#> 6407:  2022_18_LA_SEA   2022       REG   18 2023-01-08  Sunday    13:00
#> 6408:  2022_18_ARI_SF   2022       REG   18 2023-01-08  Sunday    13:00
#> 6409: 2022_18_DAL_WAS   2022       REG   18 2023-01-08  Sunday    13:00
#>       away_team away_score home_team home_score location result total overtime
#>    1:       MIN         17       ATL         14     Home     -3    31        0
#>    2:        KC         17       CHI         20     Home      3    37        0
#>    3:       PIT         43       CLE          0     Home    -43    43        0
#>    4:       OAK         24        GB         28     Home      4    52        0
#>    5:       BUF         14       IND         31     Home     17    45        0
#>   ---                                                                         
#> 6405:       NYG         NA       PHI         NA     Home     NA    NA       NA
#> 6406:       CLE         NA       PIT         NA     Home     NA    NA       NA
#> 6407:        LA         NA       SEA         NA     Home     NA    NA       NA
#> 6408:       ARI         NA        SF         NA     Home     NA    NA       NA
#> 6409:       DAL         NA       WAS         NA     Home     NA    NA       NA
#>       old_game_id gsis nfl_detail_id          pfr pff      espn away_rest
#>    1:  1999091210  598               199909120atl  NA 190912001         7
#>    2:  1999091206  597               199909120chi  NA 190912003         7
#>    3:  1999091213  604               199909120cle  NA 190912005         7
#>    4:  1999091208  602               199909120gnb  NA 190912009         7
#>    5:  1999091202  591               199909120clt  NA 190912011         7
#>   ---                                                                    
#> 6405:  2023010809   NA               202301080phi  NA        NA         7
#> 6406:  2023010810   NA               202301080pit  NA        NA         7
#> 6407:  2023010814   NA               202301080sea  NA        NA         7
#> 6408:  2023010815   NA               202301080sfo  NA        NA         7
#> 6409:  2023010811   NA               202301080was  NA        NA        10
#>       home_rest away_moneyline home_moneyline spread_line away_spread_odds
#>    1:         7             NA             NA        -4.0               NA
#>    2:         7             NA             NA        -3.0               NA
#>    3:         7             NA             NA        -6.0               NA
#>    4:         7             NA             NA         9.0               NA
#>    5:         7             NA             NA        -3.0               NA
#>   ---                                                                     
#> 6405:         7            185           -225         5.0             -110
#> 6406:         7             NA             NA          NA               NA
#> 6407:         7           -250            200        -5.5             -110
#> 6408:         7            135           -155         3.0              100
#> 6409:         7           -120            100        -1.0             -110
#>       home_spread_odds total_line under_odds over_odds div_game     roof
#>    1:               NA       49.0         NA        NA        0     dome
#>    2:               NA       38.0         NA        NA        0 outdoors
#>    3:               NA       37.0         NA        NA        1 outdoors
#>    4:               NA       43.0         NA        NA        0 outdoors
#>    5:               NA       45.5         NA        NA        1     dome
#>   ---                                                                   
#> 6405:             -110         NA         NA        NA        1 outdoors
#> 6406:               NA         NA         NA        NA        1 outdoors
#> 6407:             -110         NA         NA        NA        1 outdoors
#> 6408:             -120         NA         NA        NA        1 outdoors
#> 6409:             -110         NA         NA        NA        1 outdoors
#>         surface temp wind away_qb_id home_qb_id       away_qb_name
#>    1: astroturf   NA   NA 00-0003761 00-0002876 Randall Cunningham
#>    2:     grass   80   12 00-0006300 00-0010560        Elvis Grbac
#>    3:     grass   78   12 00-0015700 00-0004230    Kordell Stewart
#>    4:     grass   67   10 00-0005741 00-0005106        Rich Gannon
#>    5: astroturf   NA   NA 00-0005363 00-0010346        Doug Flutie
#>   ---                                                             
#> 6405:     grass   NA   NA                                         
#> 6406:     grass   NA   NA                                         
#> 6407: fieldturf   NA   NA                                         
#> 6408:     grass   NA   NA                                         
#> 6409:     grass   NA   NA                                         
#>         home_qb_name         away_coach    home_coach       referee stadium_id
#>    1: Chris Chandler       Dennis Green    Dan Reeves  Gerry Austin      ATL00
#>    2: Shane Matthews Gunther Cunningham   Dick Jauron  Phil Luckett      CHI98
#>    3:      Ty Detmer        Bill Cowher  Chris Palmer   Bob McElwee      CLE00
#>    4:    Brett Favre         Jon Gruden    Ray Rhodes Tony Corrente      GNB00
#>    5: Peyton Manning      Wade Phillips      Jim Mora      Ron Blum      IND99
#>   ---                                                                         
#> 6405:                      Brian Daboll Nick Sirianni                    PHI00
#> 6406:                   Kevin Stefanski   Mike Tomlin                    PIT00
#> 6407:                        Sean McVay  Pete Carroll                    SEA00
#> 6408:                   Kliff Kingsbury Kyle Shanahan                    SFO01
#> 6409:                     Mike McCarthy    Ron Rivera                    WAS00
#>                        stadium
#>    1:             Georgia Dome
#>    2:            Soldier Field
#>    3: Cleveland Browns Stadium
#>    4:            Lambeau Field
#>    5:                 RCA Dome
#>   ---                         
#> 6405:  Lincoln Financial Field
#> 6406:              Heinz Field
#> 6407:              Lumen Field
#> 6408:           Levi's Stadium
#> 6409:               FedExField
# }