如何有效地打开一个巨大的Excel文件
我有一个150MB的单页excel文件,需要大约7分钟,打开一个非常强大的机器上使用以下内容:
# using python import xlrd wb = xlrd.open_workbook(file) sh = wb.sheet_by_index(0)
有没有办法更快地打开Excel文件? 我打开甚至非常古怪的build议(如hadoop,火花,c,java等)。 理想情况下,我正在寻找一种方法来在30秒内打开文件,如果这不是一个梦想。 另外,上面的例子是使用python,但它不一定是python。
注意:这是来自客户端的Excel文件。 在我们收到它之前,它不能被转换成任何其他的格式。 这不是我们的文件
更新:回答一个代码的工作示例,将在30秒内打开以下200MB的Excel文件将奖励赏金: https : //drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view? usp =共享 。 这个文件应该有string(col 1),date(col 9)和数字(col 11)。
那么,如果你的excel会像你的例子一样简单的CSV文件( https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing ),你可以尝试打开文件为一个zip文件,并直接读取每个XML:
英特尔i5 4460,12GB内存,SSD三星EVO PRO。
如果你有很多的内存ram:这段代码需要很多内存,但需要20〜25秒。 (您需要参数-Xmx7g)
package com.devsaki.opensimpleexcel; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintWriter; import java.nio.charset.Charset; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.zip.ZipFile; public class Multithread { public static final char CHAR_END = (char) -1; public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx"; ZipFile zipFile = new ZipFile(excelFile); long init = System.currentTimeMillis(); ExecutorService executor = Executors.newFixedThreadPool(4); char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray(); Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor)); char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray(); Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString))); Object[][] sheet = futureSheet1.get(); String[] words = futureWords.get(); executor.shutdown(); long end = System.currentTimeMillis(); System.out.println("only read: " + (end - init) / 1000); ///Doing somethin with the file::Saving as csv init = System.currentTimeMillis(); try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) { for (Object[] rows : sheet) { for (Object cell : rows) { if (cell != null) { if (cell instanceof Integer) { writer.append(words[(Integer) cell]); } else if (cell instanceof String) { writer.append(toDate(Double.parseDouble(cell.toString()))); } else { writer.append(cell.toString()); //Probably a number } } writer.append(";"); } writer.append("\n"); } } end = System.currentTimeMillis(); System.out.println("Main saving to csv: " + (end - init) / 1000); } private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME; private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2); //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date public static String toDate(double s) { return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600)))); } public static String readEntry(ZipFile zipFile, String entry) throws IOException { System.out.println("Initialing readEntry " + entry); long init = System.currentTimeMillis(); String result = null; try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) { br.readLine(); result = br.readLine(); } long end = System.currentTimeMillis(); System.out.println("readEntry '" + entry + "': " + (end - init) / 1000); return result; } public static String[] processSharedStrings(CharReader br) throws IOException { System.out.println("Initialing processSharedStrings"); long init = System.currentTimeMillis(); String[] words = null; char[] wordCount = "Count=\"".toCharArray(); char[] token = "<t>".toCharArray(); String uniqueCount = extractNextValue(br, wordCount, '"'); words = new String[Integer.parseInt(uniqueCount)]; String nextWord; int currentIndex = 0; while ((nextWord = extractNextValue(br, token, '<')) != null) { words[currentIndex++] = nextWord; br.skip(11); //you can skip at least 11 chars "/t></si><si>" } long end = System.currentTimeMillis(); System.out.println("SharedStrings: " + (end - init) / 1000); return words; } public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException { System.out.println("Initialing processSheet1"); long init = System.currentTimeMillis(); char[] dimensionToken = "dimension ref=\"".toCharArray(); String dimension = extractNextValue(br, dimensionToken, '"'); int[] sizes = extractSizeFromDimention(dimension.split(":")[1]); br.skip(30); //Between dimension and next tag c exists more or less 30 chars Object[][] result = new Object[sizes[0]][sizes[1]]; int parallelProcess = 8; int currentIndex = br.currentIndex; CharReader[] charReaders = new CharReader[parallelProcess]; int totalChars = Math.round(br.chars.length / parallelProcess); for (int i = 0; i < parallelProcess; i++) { int endIndex = currentIndex + totalChars; charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i); currentIndex = endIndex; } Future[] futures = new Future[parallelProcess]; for (int i = charReaders.length - 1; i >= 0; i--) { final int j = i; futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result)); } for (Future future : futures) { future.get(); } long end = System.currentTimeMillis(); System.out.println("Sheet1: " + (end - init) / 1000); return result; } public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) { System.out.println("Initialing inParallelProcess : " + br.identifier); char[] tokenOpenC = "<cr=\"".toCharArray(); char[] tokenOpenV = "<v>".toCharArray(); char[] tokenAttributS = " s=\"".toCharArray(); char[] tokenAttributT = " t=\"".toCharArray(); String v; int firstCurrentIndex = br.currentIndex; boolean first = true; while ((v = extractNextValue(br, tokenOpenC, '"')) != null) { if (first && back != null) { int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1; first = false; System.out.println("Adding to : " + back.identifier + " From : " + br.identifier); back.plusLength(sum); } int[] indexes = extractSizeFromDimention(v); int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT); char type = 's'; //3 types: number (n), string (s) and date (d) if (s == 0) { // Token S = number or date char read = br.read(); if (read == '1') { type = 'n'; } else { type = 'd'; } } else if (s == -1) { type = 'n'; } String c = extractNextValue(br, tokenOpenV, '<'); Object value = null; switch (type) { case 'n': value = Double.parseDouble(c); break; case 's': try { value = Integer.parseInt(c); } catch (Exception ex) { System.out.println("Identifier Error : " + br.identifier); } break; case 'd': value = c.toString(); break; } result[indexes[0] - 1][indexes[1] - 1] = value; br.skip(7); ///v></c> } } static class CharReader { char[] chars; int currentIndex; int length; int identifier; public CharReader(char[] chars) { this.chars = chars; this.length = chars.length; } public CharReader(char[] chars, int currentIndex, int length, int identifier) { this.chars = chars; this.currentIndex = currentIndex; if (length > chars.length) { this.length = chars.length; } else { this.length = length; } this.identifier = identifier; } public void plusLength(int n) { if (this.length + n <= chars.length) { this.length += n; } } public char read() { if (currentIndex >= length) { return CHAR_END; } return chars[currentIndex++]; } public void skip(int n) { currentIndex += n; } } public static int[] extractSizeFromDimention(String dimention) { StringBuilder sb = new StringBuilder(); int columns = 0; int rows = 0; for (char c : dimention.toCharArray()) { if (columns == 0) { if (Character.isDigit(c)) { columns = convertExcelIndex(sb.toString()); sb = new StringBuilder(); } } sb.append(c); } rows = Integer.parseInt(sb.toString()); return new int[]{rows, columns}; } public static int foundNextTokens(CharReader br, char until, char[]... tokens) { char character; int[] indexes = new int[tokens.length]; while ((character = br.read()) != CHAR_END) { if (character == until) { break; } for (int i = 0; i < indexes.length; i++) { if (tokens[i][indexes[i]] == character) { indexes[i]++; if (indexes[i] == tokens[i].length) { return i; } } else { indexes[i] = 0; } } } return -1; } public static String extractNextValue(CharReader br, char[] token, char until) { char character; StringBuilder sb = new StringBuilder(); int index = 0; while ((character = br.read()) != CHAR_END) { if (index == token.length) { if (character == until) { return sb.toString(); } else { sb.append(character); } } else { if (token[index] == character) { index++; } else { index = 0; } } } return null; } public static int convertExcelIndex(String index) { int result = 0; for (char c : index.toCharArray()) { result = result * 26 + ((int) c - (int) 'A' + 1); } return result; } }
旧的答案(不需要参数Xms7g,所以占用较less的内存):需要用HDD打开和读取约35秒(200MB)的示例文件,SDD需要less一点(30秒)。
这里的代码: https : //github.com/csaki/OpenSimpleExcelFast.git
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintWriter; import java.nio.charset.Charset; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.zip.ZipFile; public class Launcher { public static final char CHAR_END = (char) -1; public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { long init = System.currentTimeMillis(); String excelFile = "D:/Downloads/BigSpreadsheet.xlsx"; ZipFile zipFile = new ZipFile(excelFile); ExecutorService executor = Executors.newFixedThreadPool(4); Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile)); Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile)); String[] words = futureWords.get(); Object[][] sheet1 = futureSheet1.get(); executor.shutdown(); long end = System.currentTimeMillis(); System.out.println("Main only open and read: " + (end - init) / 1000); ///Doing somethin with the file::Saving as csv init = System.currentTimeMillis(); try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) { for (Object[] rows : sheet1) { for (Object cell : rows) { if (cell != null) { if (cell instanceof Integer) { writer.append(words[(Integer) cell]); } else if (cell instanceof String) { writer.append(toDate(Double.parseDouble(cell.toString()))); } else { writer.append(cell.toString()); //Probably a number } } writer.append(";"); } writer.append("\n"); } } end = System.currentTimeMillis(); System.out.println("Main saving to csv: " + (end - init) / 1000); } private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME; private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2); //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date public static String toDate(double s) { return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600)))); } public static Object[][] processSheet1(ZipFile zipFile) throws IOException { String entry = "xl/worksheets/sheet1.xml"; Object[][] result = null; char[] dimensionToken = "dimension ref=\"".toCharArray(); char[] tokenOpenC = "<cr=\"".toCharArray(); char[] tokenOpenV = "<v>".toCharArray(); char[] tokenAttributS = " s=\"".toCharArray(); char[] tokenAttributT = " t=\"".toCharArray(); try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) { String dimension = extractNextValue(br, dimensionToken, '"'); int[] sizes = extractSizeFromDimention(dimension.split(":")[1]); br.skip(30); //Between dimension and next tag c exists more or less 30 chars result = new Object[sizes[0]][sizes[1]]; String v; while ((v = extractNextValue(br, tokenOpenC, '"')) != null) { int[] indexes = extractSizeFromDimention(v); int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT); char type = 's'; //3 types: number (n), string (s) and date (d) if (s == 0) { // Token S = number or date char read = (char) br.read(); if (read == '1') { type = 'n'; } else { type = 'd'; } } else if (s == -1) { type = 'n'; } String c = extractNextValue(br, tokenOpenV, '<'); Object value = null; switch (type) { case 'n': value = Double.parseDouble(c); break; case 's': value = Integer.parseInt(c); break; case 'd': value = c.toString(); break; } result[indexes[0] - 1][indexes[1] - 1] = value; br.skip(7); ///v></c> } } return result; } public static int[] extractSizeFromDimention(String dimention) { StringBuilder sb = new StringBuilder(); int columns = 0; int rows = 0; for (char c : dimention.toCharArray()) { if (columns == 0) { if (Character.isDigit(c)) { columns = convertExcelIndex(sb.toString()); sb = new StringBuilder(); } } sb.append(c); } rows = Integer.parseInt(sb.toString()); return new int[]{rows, columns}; } public static String[] processSharedStrings(ZipFile zipFile) throws IOException { String entry = "xl/sharedStrings.xml"; String[] words = null; char[] wordCount = "Count=\"".toCharArray(); char[] token = "<t>".toCharArray(); try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) { String uniqueCount = extractNextValue(br, wordCount, '"'); words = new String[Integer.parseInt(uniqueCount)]; String nextWord; int currentIndex = 0; while ((nextWord = extractNextValue(br, token, '<')) != null) { words[currentIndex++] = nextWord; br.skip(11); //you can skip at least 11 chars "/t></si><si>" } } return words; } public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException { char character; int[] indexes = new int[tokens.length]; while ((character = (char) br.read()) != CHAR_END) { if (character == until) { break; } for (int i = 0; i < indexes.length; i++) { if (tokens[i][indexes[i]] == character) { indexes[i]++; if (indexes[i] == tokens[i].length) { return i; } } else { indexes[i] = 0; } } } return -1; } public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException { char character; StringBuilder sb = new StringBuilder(); int index = 0; while ((character = (char) br.read()) != CHAR_END) { if (index == token.length) { if (character == until) { return sb.toString(); } else { sb.append(character); } } else { if (token[index] == character) { index++; } else { index = 0; } } } return null; } public static int convertExcelIndex(String index) { int result = 0; for (char c : index.toCharArray()) { result = result * 26 + ((int) c - (int) 'A' + 1); } return result; } }
大多数使用Office产品的编程语言都有一些中间层,这通常是瓶颈所在,使用PIA / Interop或Open XML SDK就是一个很好的例子。
一种让数据处于较低级别的方式(绕过中间层)是使用驱动程序。
150MB的单张excel文件,大约需要7分钟。
我能做的最好的是135秒内的130MB文件,大约快了3倍:
Stopwatch sw = new Stopwatch(); sw.Start(); DataSet excelDataSet = new DataSet(); string filePath = @"c:\temp\BigBook.xlsx"; // For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with "';Extended Properties=\"Excel 8.0;HDR=YES;\""; string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\""; using (OleDbConnection conn = new OleDbConnection(connectionString)) { conn.Open(); OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter ("select * from [Sheet1$]", conn); objDA.Fill(excelDataSet); //dataGridView1.DataSource = excelDataSet.Tables[0]; } sw.Stop(); Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = " + excelDataSet.Tables[0].Rows.Count);
赢得7×64,英特尔i5,2.3ghz,8GB内存,SSD250GB。
如果我可以推荐一个硬件解决scheme,如果您使用的是标准硬盘,请尝试使用SSD来解决这个问题。
注意:我无法下载您的Excel电子表格示例,因为我在公司防火墙后面。
PS。 请参阅MSDN – 最快的方法来导入200MB数据的xlsx文件 , 共识是OleDB是最快的。
PS 2.以下是如何使用python: http : //code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/
我设法使用.NET核心和Open XML SDK在大约30秒内读取文件。
下面的示例返回包含所有行和单元格的对象列表,它们支持date,数字和文本单元格。 该项目可在这里: https : //github.com/xferaa/BigSpreadSheetExample/ (应该在Windows,Linux和Mac OS上工作,不需要安装Excel或任何Excel组件)。
public List<List<object>> ParseSpreadSheet() { List<List<object>> rows = new List<List<object>>(); using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false)) { WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); OpenXmlReader reader = OpenXmlReader.Create(worksheetPart); Dictionary<int, string> sharedStringCache = new Dictionary<int, string>(); int i = 0; foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements) { sharedStringCache.Add(i++, el.InnerText); } while (reader.Read()) { if(reader.ElementType == typeof(Row)) { reader.ReadFirstChild(); List<object> cells = new List<object>(); do { if (reader.ElementType == typeof(Cell)) { Cell c = (Cell)reader.LoadCurrentElement(); if (c == null || c.DataType == null || !c.DataType.HasValue) continue; object value; switch(c.DataType.Value) { case CellValues.Boolean: value = bool.Parse(c.CellValue.InnerText); break; case CellValues.Date: value = DateTime.Parse(c.CellValue.InnerText); break; case CellValues.Number: value = double.Parse(c.CellValue.InnerText); break; case CellValues.InlineString: case CellValues.String: value = c.CellValue.InnerText; break; case CellValues.SharedString: value = sharedStringCache[int.Parse(c.CellValue.InnerText)]; break; default: continue; } if (value != null) cells.Add(value); } } while (reader.ReadNextSibling()); if (cells.Any()) rows.Add(cells); } } } return rows; }
我在一台3年前的笔记本电脑上运行了这个程序,这个笔记本电脑有SSD驱动器,8GB内存和一个Windows 10 64位Intel Core i7-4710 CPU @ 2.50GHz(两个内核)。
请注意,尽pipe以stringforms打开和parsing整个文件需要的时间less于30秒,但在使用对象时(如上次编辑的示例),使用我的蹩脚笔记本电脑的时间将快到50秒。 你的Linux服务器可能会接近30秒。
诀窍是使用SAX方法,如下所述:
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx
Python的Pandas库可以用来保存和处理你的数据,但是使用它来直接加载.xlsx
文件将会非常慢,例如使用read_excel()
。
一种方法是使用Python自动将文件转换为使用Excel本身的CSV文件,然后使用Pandas使用read_csv()
加载生成的CSV文件。 这会给你一个很好的加速,但不会低于30秒:
import win32com.client as win32 import pandas as pd from datetime import datetime print ("Starting") start = datetime.now() # Use Excel to load the xlsx file and save it in csv format excel = win32.gencache.EnsureDispatch('Excel.Application') wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx') excel.DisplayAlerts = False wb.DoNotPromptForConvert = True wb.CheckCompatibility = False print('Saving') wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) excel.Application.Quit() # Use Pandas to load the resulting CSV file print('Loading CSV') df = pd.read_csv(r'c:\full path\temp.csv', dtype=str) print(df.shape) print("Done", datetime.now() - start)
列types
您的列的types可以通过传递parse_dates
和converters
和parse_dates
来指定:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)
您还应该指定infer_datetime_format=True
,因为这将大大加速date转换。
nfer_datetime_format
:布尔值,默认为False如果启用了True和parse_dates,pandas将尝试推断列中的date时间string的格式,如果可以推断,则切换到parsing它们的更快的方法。 在某些情况下,这可以将parsing速度提高5-10倍。
如果date的格式为DD/MM/YYYY
则还需要添加dayfirst=True
。
select性列
如果实际上只需要处理列1 9 11
,那么可以通过指定usecols=[0, 8, 10]
来进一步减less资源,如下所示:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])
结果数据框将只包含这3列数据。
RAM驱动器
使用RAM驱动器来存储临时CSV文件将进一步加快加载时间。
注意:这确实假定您正在使用可用Excel的Windows PC。
我正在使用戴尔Precision T1700工作站,并使用c#我可以打开文件,并在大约24秒内使用标准代码读取它的内容,以使用互操作服务打开工作簿。 使用对Microsoft Excel 15.0对象库的引用是我的代码。
我使用的语句:
using System.Runtime.InteropServices; using Excel = Microsoft.Office.Interop.Excel;
打开和阅读工作簿的代码:
public partial class MainWindow : Window { public MainWindow() { InitializeComponent(); Excel.Application xlApp; Excel.Workbook wb; Excel.Worksheet ws; xlApp = new Excel.Application(); xlApp.Visible = false; xlApp.ScreenUpdating = false; wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx"); ws = wb.Sheets["Sheet1"]; //string rng = ws.get_Range("A1").Value; MessageBox.Show(ws.get_Range("A1").Value); Marshal.FinalReleaseComObject(ws); wb.Close(); Marshal.FinalReleaseComObject(wb); xlApp.Quit(); Marshal.FinalReleaseComObject(xlApp); GC.Collect(); GC.WaitForPendingFinalizers(); } }
我创build了一个示例Java程序,能够在我的笔记本电脑(Intel i7 4核心,16 GB RAM)〜40秒内加载文件。
https://github.com/skadyan/largefile
该程序使用Apache POI库使用XSSF SAX API加载.xlsx文件。
callback接口com.stackoverlfow.largefile.RecordHandler
实现可用于处理从Excel中加载的数据。 这个接口只定义了一个带三个参数的方法
- sheetname:string,excel表名
- 行号:int,行数
- 和
data map
:Map:excel单元格引用和excel格式化的单元格值
类com.stackoverlfow.largefile.Main
演示了这个接口的一个基本实现,它只是在控制台上打印行号。
更新
woodstoxparsing器似乎比标准SAXReader
有更好的性能。 (代码更新回购)。
另外为了满足所需的性能要求,您可以考虑重新实现org.apache.poi...XSSFSheetXMLHandler
。 在实现中,可以实现更优化的string/文本值处理,并且可以跳过不必要的文本格式化操作。
看起来在Python中几乎是不可能实现的。 如果我们解压一个表单数据文件,那么只需要30秒就可以通过基于C的迭代SAXparsing器(使用lxml
,这是一个快速封装libxml2
):
from __future__ import print_function from lxml import etree import time start_ts = time.time() for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), collect_ids=False, resolve_entities=False, huge_tree=True): pass print(time.time() - start_ts)
示例输出:27.2134890556
顺便说一下,Excel本身需要大约40秒来加载工作簿。
c#和ole解决scheme仍然有一些瓶颈。所以我用c ++和ado来testing它。
_bstr_t connStr(makeConnStr(excelFile, header).c_str()); TESTHR(pRec.CreateInstance(__uuidof(Recordset))); TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText)); while(!pRec->adoEOF) { for(long i = 0; i < pRec->Fields->GetCount(); ++i) { _variant_t v = pRec->Fields->GetItem(i)->Value; if(v.vt == VT_R8) num[i] = v.dblVal; if(v.vt == VT_BSTR) str[i] = v.bstrVal; ++cellCount; } pRec->MoveNext(); }
在i5-4460和HDD机器上,我发现xls中的50万个单元格会占用1.5个字节,而在xlsx中的相同数据量则需要2.829个。因此,可以在30秒内处理您的数据。
如果你真的需要30岁以下,使用RAM驱动器来减less文件IO.It将显着改善你的过程。 我无法下载你的数据来testing,所以请告诉我结果。
另一种应该大大改善负载/操作时间的方法是RAMDrive
创build一个RAMDrive足够的空间为您的文件和10%.. 20%的额外空间…
复制RAMDrive的文件…
从那里加载文件…取决于你的驱动器和文件系统的速度提高应该是巨大的…
我最喜欢的是IMDisk工具包
( https://sourceforge.net/projects/imdisk-toolkit/ )在这里你有一个强大的命令行脚本的一切…
我也推荐SoftPerfect ramdisk
( http://www.majorgeeks.com/files/details/softperfect_ram_disk.html )
但这也取决于您的操作系统…
我想有更多的信息关于您打开文件的系统…无论如何:
查看您的系统中是否有Windows更新
“Office的Office文件validation加载项…”
如果你有它…卸载它…
该文件应加载得更快
特别是如果从共享加载
保存您的Excel工作表到一个制表符分隔的文件,并打开它,因为你通常会读一个普通的TXT 🙂
编辑:然后,您可以逐行阅读文件,并在选项卡上拆分行。 通过索引获取所需的数据列。
你有没有尝试加载工作表的需求 ,从xlrd 0.7.1版本可用?
为此,您需要将on_demand=True
传递给open_workbook()
。
xlrd.open_workbook(filename = None,logfile = <_ io.TextIOWrapper name =''mode ='w'encoding ='UTF-8'>,verbosity = 0,use_mmap = 1,file_contents = None,encoding_override = None,formatting_info = False,on_demand = False,ragged_rows = False)
我发现读取xlsx文件的其他潜在的python解决scheme:
- 阅读'xl / sharedStrings.xml'和'xl / worksheets / sheet1.xml'中的原始xml
-
尝试openpyxl库的只读模式 ,该模式声明在大文件的内存使用情况下也被优化。
from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data'] for row in ws.rows: for cell in row: print(cell.value)
-
如果你在Windows上运行,你可以使用PyWin32和'Excel.Application'
import time import win32com.client as win32 def excel(): xl = win32.gencache.EnsureDispatch('Excel.Application') ss = xl.Workbooks.Add() ...