The Read.Table function reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.
read.table( file, header = FALSE, sep = "", quote = "\"'", dec = ".", row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text )
- file can be a file, an Url or a connection.
- header indicate if the file has a header line
- sep is a string indicating how the columns are separated
- colClasses, a character vector indicating the class of each column in the dataset
- nrows, the number of rows in the dataset
- comment.char, a character string indicating the comment character
- skip, the number of lines to skip from the beginning
- stringsAsFactors, should character variables be coded as factors?
By default, Read.table will:
- figure out: colclasses (what type of variable is in each column of the table)
- check if each line is a comment: comment.char (comment.char = “” disable it)
By giving R all these parameters will make R run faster as it don't need to perform them.
The dataset must no be larger than the amount of your RAM.
1,000,000 rows, 10 columns with numeric data = 1,000,000 * 10 * 8 bytes = 76 Mb
colClasses = "numeric"
To figure out the classes of each column, you can use this snippets:
mySubsetDataTable = read.table("myFile.txt", nrows = 100) classes = sapply(mySubsetDataTable, class) myDataTable = read.table("myFile.txt", colClasses = classes)
See the Linux tool wc on how to calculate the number of lines in a file.
Setting nrows will help with memory usage.