Java用7000万行文本读取大文本文件

我有一个7000万行文本的大testing文件。我必须逐行阅读文件。

我使用了两种不同的方法：

InputStreamReader isr = new InputStreamReader(new FileInputStream(FilePath),"unicode"); BufferedReader br = new BufferedReader(isr); while((cur=br.readLine()) != null);

和

 LineIterator it = FileUtils.lineIterator(new File(FilePath), "unicode"); while(it.hasNext()) cur=it.nextLine();

还有另一种方法可以使这个任务更快吗？

最好的祝福，

1）我相信速度没有区别，都使用FileInputStream内部和缓冲

2）你可以进行测量，看看你自己

3）虽然没有性能上的好处我喜欢1.7的方法

 try (BufferedReader br = Files.newBufferedReader(Paths.get("test.txt"), StandardCharsets.UTF_8)) { for (String line = null; (line = br.readLine()) != null;) { // } }

4）基于扫描仪的版本

  try (Scanner sc = new Scanner(new File("test.txt"), "UTF-8")) { while (sc.hasNextLine()) { String line = sc.nextLine(); } // note that Scanner suppresses exceptions if (sc.ioException() != null) { throw sc.ioException(); } }

5）这可能比其他的要快

 try (SeekableByteChannel ch = Files.newByteChannel(Paths.get("test.txt"))) { ByteBuffer bb = ByteBuffer.allocateDirect(1000); for(;;) { StringBuilder line = new StringBuilder(); int n = ch.read(bb); // add chars to line // ... } }

它需要一些编码，但是由于ByteBuffer.allocateDirect它可以更快。它允许操作系统直接从文件读取字节到ByteBuffer，而不需要复制

6）并行处理肯定会提高速度。创build一个大的字节缓冲区，运行几个任务，从文件中读取字节并行的缓冲区，当准备好find第一个行结束时，做一个string，find下一个…

如果你正在看性能，你可以看看java.nio.*包 – 这些应该比java.io.*更快

有一篇文章基准不同的阅读文件的方式。它会帮助你find最好的解决scheme。

文档： Java技巧：如何快速读取文件

我有一个类似的问题，但我只需要从文件中的字节。我阅读了各种答案中提供的链接，并最终尝试在Evgeniy的答案中写下类似于＃5的链接。他们不是在开玩笑，它花了很多代码。

基本的前提是每一行文字的长度都是未知的。我将从一个SeekableByteChannel开始，将数据读入一个ByteBuffer，然后遍历它寻找EOL。当循环之间的东西是“遗留”时，它增加一个计数器，然后最终移动SeekableByteChannel位置并读取整个缓冲区。

这是详细的…但它的作品。这对我所需要的东西来说足够快，但我相信还有更多可以改进的地方。

这个过程的方法被剥离到基本的开始阅读文件。

 private long startOffset; private long endOffset; private SeekableByteChannel sbc; private final ByteBuffer buffer = ByteBuffer.allocateDirect(1024); public void process() throws IOException { startOffset = 0; sbc = Files.newByteChannel(FILE, EnumSet.of(READ)); byte[] message = null; while((message = readRecord()) != null) { // do something } } public byte[] readRecord() throws IOException { endOffset = startOffset; boolean eol = false; boolean carryOver = false; byte[] record = null; while(!eol) { byte data; buffer.clear(); final int bytesRead = sbc.read(buffer); if(bytesRead == -1) { return null; } buffer.flip(); for(int i = 0; i < bytesRead && !eol; i++) { data = buffer.get(); if(data == '\r' || data == '\n') { eol = true; endOffset += i; if(carryOver) { final int messageSize = (int)(endOffset - startOffset); sbc.position(startOffset); final ByteBuffer tempBuffer = ByteBuffer.allocateDirect(messageSize); sbc.read(tempBuffer); tempBuffer.flip(); record = new byte[messageSize]; tempBuffer.get(record); } else { record = new byte[i]; // Need to move the buffer position back since the get moved it forward buffer.position(0); buffer.get(record, 0, i); } // Skip past the newline characters if(isWindowsOS()) { startOffset = (endOffset + 2); } else { startOffset = (endOffset + 1); } // Move the file position back sbc.position(startOffset); } } if(!eol && sbc.position() == sbc.size()) { // We have hit the end of the file, just take all the bytes record = new byte[bytesRead]; eol = true; buffer.position(0); buffer.get(record, 0, bytesRead); } else if(!eol) { // The EOL marker wasn't found, continue the loop carryOver = true; endOffset += bytesRead; } } // System.out.println(new String(record)); return record; }

这篇文章是一个很好的开始。

此外，您还需要创buildtesting用例，其中您首先读取10k（或其他内容，但不应太小）行，并相应地计算读取时间。

线程可能是一个很好的方法，但重要的是我们知道你将要用数据做什么。

另一件需要考虑的事情是，你将如何存储这个大小的数据。

Java用7000万行文本读取大文本文件

如何将InputStream转换为FileInputStream

合并PDF文件

jQuery：获取从<input type =“file”/>中select的文件名

如何检查Java中是否存在文件？

.NET能否加载和parsing一个相当于Java Properties类的属性文件？

Python在64位窗口上的32位内存限制

如何将文件保存到类path

文件存在和IS目录，但listFiles（）返回null

与boost.asio和文件I / O有什么关系？

文件写入 – PrintStream追加