I've been working on a project which requires that I read random lines from a set of files. Unfortunately these files range between 200MB and 2GB so reading the entire file into memory is incredibly slow if possible at all. I looked around for a simple way to do this with large files and I was unable to find a good one, so here you go:
#!/usr/bin/python
import os,random
filename="averylargefile"
file = open(filename,'r')
#Get the total file size
file_size = os.stat(filename)[6]
while 1:
#Seek to a place in the file which is a random distance away
#Mod by file size so that it wraps around to the beginning
file.seek((file.tell()+random.randint(0,file_size-1))%file_size)
#dont use the first readline since it may fall in the middle of a line
file.readline()
#this will return the next (complete) line from the file
line = file.readline()
#here is your random line in the file
print line
Using this method has shown to be MUCH faster then doing something along the lines of:
for line in file:
#do something
and furthermore, this method does not make it easy to get a random entry in the file.
Enjoy!
Update: Just a note on the "file.seek" line above, the "file.tell()" call is not strictly necessary since what line you are on currently has no bearing on the next line you will be going to. That would also remove the need for the mod operator since the randint would never be larger than the file. Furthermore, since you are advancing the file pointer to a random spot in the file, if all lines are not an equal length then you will not get true randomness, longer lines will in fact be more likely to be hit. Since, however, the algorithm takes the line after random spot seeked to in the file, this issue is lessened but not resolved. I'm looking for a better and more efficient solution in the meantime.
Read a random line in a (large) file in Python
Posted by
Jonathan Kupferman
on Tuesday, November 4, 2008
Search
Labels
- (14) microsoft
- (11) Google
- (9) Amazon
- (9) cloud computing
- (8) windows
- (6) apple
- (5) amazon ec2
- (5) linux
- (4) Rails
- (4) Ruby on Rails
- (4) advertising
- (4) design
- (4) laptop
- (4) python
- (4) web performance
- (3) Blog
- (3) Caching
- (3) Facebook
- (3) Gmail
- (3) Internet Explorer
- (3) Kindle
- (3) Performance
- (3) Platform-as-a-Service
- (3) RSS
- (3) RSS Readers
- (3) Rackspace
- (3) Ruby
- (3) Seinfeld
- (3) Yahoo
- (3) web design
- (2) Andriod
- (2) Azure
- (2) Bloggers
- (2) CSS
- (2) Chrome
- (2) College
- (2) Downtime
- (2) Firefox
- (2) G1
- (2) Gates
- (2) Google Andriod
- (2) Google Chrome
- (2) HTML
- (2) Jeff Bezos
- (2) Memcached
- (2) MySQL
- (2) Relational Database
- (2) Slashdot
- (2) Software
- (2) Twitter
- (2) Website Performance
- (2) bing
- (2) data storage
- (2) dell
- (2) google app
- (2) iPhone
- (2) interview questions
- (2) ipod
- (2) programming
- (2) search engine
- (2) torrent
- (2) usability
- (2) vista
- (2) web applications
3 comments:
Hmm nice, I'll try it next time when I have some free time.
Jon,
Try the following a la a question Adam had from a google interview:
----
#!/usr/bin/env python
import random, sys
file = open(sys.argv[1])
count = 0
to_print = cur = file.readline()
while cur:
cur = file.readline()
count += 1
if random.randint(0,count) == 0: # 1/count chance to replace
to_print = cur
print to_print[:-1] # Remove new line at end
---
This guarantees each line has equal probability. For instance the probability that the first line is chosen is:
P(0) = 1/1 * 1/2 * 2/3 * 3/4 * ... * (n-1)/n
which if I'm not mistaken results in 1/n.
If you wanted to do random lines many times you could either repeat this multiple times, or alternatively keep x separate random lines each of which you would calculate a different random value for. To slightly optimize that rather than storing the line in memory you could just store the seek value of the start of the line.
Thank's a lot for this clever code!
Here's a little modification to enhance it a bit, because it will never read the first line of the file since you always skip the first read line:
#dont use the first readline since it may fall in the middle of a line UNLESS we are at the beginning of the file
if file.tell() > 0:
file.readline()
Post a Comment