The Read.Table function reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.
read.table(
file,
header = FALSE,
sep = "",
quote = "\"'",
dec = ".",
row.names,
col.names,
as.is = !stringsAsFactors,
na.strings = "NA",
colClasses = NA,
nrows = -1,
skip = 0,
check.names = TRUE,
fill = !blank.lines.skip,
strip.white = FALSE,
blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE,
flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "",
encoding = "unknown",
text
)
where:
By default, Read.table will:
By giving R all these parameters will make R run faster as it don't need to perform them.
The dataset must no be larger than the amount of your RAM.
1,000,000 rows, 10 columns with numeric data = 1,000,000 * 10 * 8 bytes = 76 Mb
colClasses = "numeric"
To figure out the classes of each column, you can use this snippets:
mySubsetDataTable = read.table("myFile.txt", nrows = 100)
classes = sapply(mySubsetDataTable, class)
myDataTable = read.table("myFile.txt", colClasses = classes)
See the Linux tool wc on how to calculate the number of lines in a file.
Setting nrows will help with memory usage.