读取一个文本文件,并将其拆分成python中的单个单词
所以我有这个文本文件由数字和单词组成,例如像这样 – 09807754 18 n 03 aristocrat 0 blue_blood 0 patrician
,我想分裂它,以便每个单词或数字将出现一个新的行。
一个空白分隔符将是理想的,因为我想用破折号的话保持连接。
这是我迄今为止:
f = open('words.txt', 'r') for word in f: print(word)
不太确定如何从这里走,我想这是成果:
09807754 18 n 3 aristocrat ...
如果你的数据没有引号:
with open('words.txt','r') as f: for line in f: for word in line.split(): print(word)
如果您想在文件的每一行中使用单词的嵌套列表:
with open("words.txt") as f: [line.split() for line in f]
或者,如果您想将其压缩成文件中的单个单词列表,则可以这样做:
with open("words.txt") as f: [word for line in f for word in line.split()]
如果你想要一个正则expression式的解决scheme:
import re with open("words.txt") as f: for line in f: for word in re.findall(r'\w+', line): # word by word
或者,如果你想这是一个逐行生成器与正则expression式:
with open("words.txt") as f: (word for line in f for word in re.findall(r'\w+', line))
f = open('words.txt') for word in f.read().split(): print(word)
作为补充,如果您正在读取一个vvvvery大文件,并且不想一次将所有内容读入内存,则可以考虑使用缓冲区 ,然后通过yield返回每个单词:
def read_words(inputfile): with open(inputfile, 'r') as f: while True: buf = f.read(10240) if not buf: break # make sure we end on a space (word boundary) while not str.isspace(buf[-1]): ch = f.read(1) if not ch: break buf += ch words = buf.split() for word in words: yield word yield '' #handle the scene that the file is empty if __name__ == "__main__": for word in read_words('./very_large_file.txt'): process(word)
这是我完全function的方法,避免了不得不阅读和拆分线。 它使用了itertools
模块:
注意python 3,用map
replaceitertools.imap
import itertools def readwords(mfile): byte_stream = itertools.groupby( itertools.takewhile(lambda c: bool(c), itertools.imap(mfile.read, itertools.repeat(1))), str.isspace) return ("".join(group) for pred, group in byte_stream if not pred)
示例用法:
>>> import sys >>> for w in readwords(sys.stdin): ... print (w) ... I really love this new method of reading words in python I really love this new method of reading words in python It's soo very Functional! It's soo very Functional! >>>
我想你的情况,这将是使用该function的方式:
with open('words.txt', 'r') as f: for word in readwords(f): print(word)