使用Interop从Excel文件中删除空行和列的最快速的方法
我有很多包含数据的Excel文件,它包含空行和空列。 如下图所示
我正在尝试使用互操作从Excel中删除空行和列。 我创build了一个简单的winform应用程序,并使用下面的代码,它工作正常。
Dim lstFiles As New List(Of String) lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories)) Dim m_XlApp = New Excel.Application Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks Dim m_xlWrkb As Excel.Workbook For Each strFile As String In lstFiles m_xlWrkb = m_xlWrkbs.Open(strFile) Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1) Dim intRow As Integer = 1 While intRow <= m_XlWrkSheet.UsedRange.Rows.Count If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp) Else intRow += 1 End If End While Dim intCol As Integer = 1 While intCol <= m_XlWrkSheet.UsedRange.Columns.Count If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft) Else intCol += 1 End If End While Next m_xlWrkb.Save() m_xlWrkb.Close(SaveChanges:=True) Marshal.ReleaseComObject(m_xlWrkb) Marshal.ReleaseComObject(m_xlWrkbs) m_XlApp.Quit() Marshal.ReleaseComObject(m_XlApp)
但是,当清理大的Excel文件需要很多时间。 任何build议优化这个代码? 或另一种方式来清理这个Excel文件更快? 是否有一个function,可以删除空单击一次点击?
我没有问题,如果答案是使用C#
编辑:
我上传了示例文件示例文件 。 但并不是所有文件都有相同的结构。
如果工作表很大,我发现在Excel工作表中循环可能需要一些时间。 所以我的解决scheme试图避免工作表中的任何循环。 为了避免在工作表中循环,我从usedRange
返回的单元格中创build了二维对象数组:
Excel.Range targetCells = worksheet.UsedRange; object[,] allValues = (object[,])targetCells.Cells.Value;
这是我循环的数组获取空行和列的索引。 我做了2个int列表,一个保持行索引删除另一个保持列索引删除。
List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols); List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);
这些列表将从高到低sorting,以简化从下往上删除行并从右向左删除列。 然后简单地遍历每个列表并删除适当的行/列。
DeleteRows(emptyRows, worksheet); DeleteCols(emptyCols, worksheet);
最后在所有的空行和列都被删除之后,我把文件另存为一个新的文件名。
希望这可以帮助。
编辑:
解决UsedRange问题,如果工作表顶部有空行,现在将删除这些行。 这也将删除开始数据左侧的空列。 这样即使在数据启动之前有空行或列,索引也能正常工作。 这是通过取得UsedRange中第一个单元格的地址来完成的,它将是“$ A $ 1:$ D $ 4”格式的地址。 这将允许使用偏移量,如果顶部的空行和左侧的空列将保持不被删除。 在这种情况下,我只是删除它们。 要获得从顶部删除的行数可以通过第一个“$ A $ 4”地址来计算,其中“4”是第一个数据出现的行。 所以我们需要删除前3行。 列地址的forms是“A”,“AB”甚至“AAD”这需要一些翻译,并感谢如何将列号(例如127)转换为Excel列(例如AA)我能够确定左边有多less列需要删除。
class Program { static void Main(string[] args) { Excel.Application excel = new Excel.Application(); string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls"; Excel.Workbook workbook = excel.Workbooks.Open(originalPath); Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"]; Excel.Range usedRange = worksheet.UsedRange; RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange); DeleteEmptyRowsCols(worksheet); string newPath = @"H:\ExcelTestFolder\Book1_Test_Removed.xls"; workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange); workbook.Close(); excel.Quit(); System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook); System.Runtime.InteropServices.Marshal.ReleaseComObject(excel); Console.WriteLine("Finished removing empty rows and columns - Press any key to exit"); Console.ReadKey(); } private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet) { Excel.Range targetCells = worksheet.UsedRange; object[,] allValues = (object[,])targetCells.Cells.Value; int totalRows = targetCells.Rows.Count; int totalCols = targetCells.Columns.Count; List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols); List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols); // now we have a list of the empty rows and columns we need to delete DeleteRows(emptyRows, worksheet); DeleteCols(emptyCols, worksheet); } private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet) { // the rows are sorted high to low - so index's wont shift foreach (int rowIndex in rowsToDelete) { worksheet.Rows[rowIndex].Delete(); } } private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet) { // the cols are sorted high to low - so index's wont shift foreach (int colIndex in colsToDelete) { worksheet.Columns[colIndex].Delete(); } } private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols) { List<int> emptyRows = new List<int>(); for (int i = 1; i < totalRows; i++) { if (IsRowEmpty(allValues, i, totalCols)) { emptyRows.Add(i); } } // sort the list from high to low return emptyRows.OrderByDescending(x => x).ToList(); } private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols) { List<int> emptyCols = new List<int>(); for (int i = 1; i < totalCols; i++) { if (IsColumnEmpty(allValues, i, totalRows)) { emptyCols.Add(i); } } // sort the list from high to low return emptyCols.OrderByDescending(x => x).ToList(); } private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows) { for (int i = 1; i < totalRows; i++) { if (allValues[i, colIndex] != null) { return false; } } return true; } private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols) { for (int i = 1; i < totalCols; i++) { if (allValues[rowIndex, i] != null) { return false; } } return true; } private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange) { string addressString = usedRange.Address.ToString(); int rowsToDelete = GetNumberOfTopRowsToDelete(addressString); DeleteTopEmptyRows(worksheet, rowsToDelete); int colsToDelete = GetNumberOfLeftColsToDelte(addressString); DeleteLeftEmptyColumns(worksheet, colsToDelete); } private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow) { for (int i = 0; i < startRow - 1; i++) { worksheet.Rows[1].Delete(); } } private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount) { for (int i = 0; i < colCount - 1; i++) { worksheet.Columns[1].Delete(); } } private static int GetNumberOfTopRowsToDelete(string address) { string[] splitArray = address.Split(':'); string firstIndex = splitArray[0]; splitArray = firstIndex.Split('$'); string value = splitArray[2]; int returnValue = -1; if ((int.TryParse(value, out returnValue)) && (returnValue >= 0)) return returnValue; return returnValue; } private static int GetNumberOfLeftColsToDelte(string address) { string[] splitArray = address.Split(':'); string firstindex = splitArray[0]; splitArray = firstindex.Split('$'); string value = splitArray[1]; return ParseColHeaderToIndex(value); } private static int ParseColHeaderToIndex(string colAdress) { int[] digits = new int[colAdress.Length]; for (int i = 0; i < colAdress.Length; ++i) { digits[i] = Convert.ToInt32(colAdress[i]) - 64; } int mul = 1; int res = 0; for (int pos = digits.Length - 1; pos >= 0; --pos) { res += digits[pos] * mul; mul *= 26; } return res; } }
编辑2:为了testing,我做了一个方法,循环通过工作表,并将其与我通过对象数组循环的代码进行比较。 它显示出显着的差异。
方法循环通过工作表并删除空的行和列。
enum RowOrCol { Row, Column }; private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet) { Excel.Range usedRange = worksheet.UsedRange; int totalRows = usedRange.Rows.Count; int totalCols = usedRange.Columns.Count; RemoveEmpty(usedRange, RowOrCol.Row); RemoveEmpty(usedRange, RowOrCol.Column); } private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol) { int count; Excel.Range curRange; if (rowOrCol == RowOrCol.Column) count = usedRange.Columns.Count; else count = usedRange.Rows.Count; for (int i = count; i > 0; i--) { bool isEmpty = true; if (rowOrCol == RowOrCol.Column) curRange = usedRange.Columns[i]; else curRange = usedRange.Rows[i]; foreach (Excel.Range cell in curRange.Cells) { if (cell.Value != null) { isEmpty = false; break; // we can exit this loop since the range is not empty } else { // Cell value is null contiue checking } } // end loop thru each cell in this range (row or column) if (isEmpty) { curRange.Delete(); } } }
然后一个主要用于testing/计时这两种方法。
enum RowOrCol { Row, Column }; static void Main(string[] args) { Excel.Application excel = new Excel.Application(); string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls"; Excel.Workbook workbook = excel.Workbooks.Open(originalPath); Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"]; Excel.Range usedRange = worksheet.UsedRange; // Start test for looping thru each excel worksheet Stopwatch sw = new Stopwatch(); Console.WriteLine("Start stopwatch to loop thru WORKSHEET..."); sw.Start(); ConventionalRemoveEmptyRowsCols(worksheet); sw.Stop(); Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns..."); string newPath = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls"; workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange); workbook.Close(); Console.WriteLine(""); // Start test for looping thru object array workbook = excel.Workbooks.Open(originalPath); worksheet = workbook.Worksheets["Sheet1"]; usedRange = worksheet.UsedRange; Console.WriteLine("Start stopwatch to loop thru object array..."); sw = new Stopwatch(); sw.Start(); DeleteEmptyRowsCols(worksheet); sw.Stop(); // display results from second test Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns..."); string newPath2 = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls"; workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange); workbook.Close(); excel.Quit(); System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook); System.Runtime.InteropServices.Marshal.ReleaseComObject(excel); Console.WriteLine(""); Console.WriteLine("Finished testing methods - Press any key to exit"); Console.ReadKey(); }
编辑3根据OP请求…我更新和更改代码以匹配OP代码。 有了这个我发现了一些有趣的结果。 见下文。
我改变了代码来匹配你正在使用的函数… EntireRow和CountA。 下面的代码我发现它非常糟糕。 运行一些testing我发现下面的代码是在800 +毫秒的执行时间。 然而,一个微妙的变化造成了巨大的变化。
在线上:
while (rowIndex <= worksheet.UsedRange.Rows.Count)
这会让事情减慢很多。 如果你为UsedRang创build一个范围variables而不保留regrabbibg,那么while循环的每一次迭代都会产生巨大的差异。 所以…当我把while循环改成…
Excel.Range usedRange = worksheet.UsedRange; int rowIndex = 1; while (rowIndex <= usedRange.Rows.Count) and while (colIndex <= usedRange.Columns.Count)
这与我的对象数组解决scheme非常接近。 我没有发布结果,因为你可以使用下面的代码,并改变while循环来获取每次迭代UsedRange或使用variablesusedRange来testing。
private static void RemoveEmptyRowsCols3(Excel.Worksheet worksheet) { //Excel.Range usedRange = worksheet.UsedRange; // <- using this variable makes the while loop much faster int rowIndex = 1; // delete empty rows //while (rowIndex <= usedRange.Rows.Count) // <- changing this one line makes a huge difference - not grabbibg the UsedRange with each iteration... while (rowIndex <= worksheet.UsedRange.Rows.Count) { if (excel.WorksheetFunction.CountA(worksheet.Cells[rowIndex, 1].EntireRow) == 0) { worksheet.Cells[rowIndex, 1].EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp); } else { rowIndex++; } } // delete empty columns int colIndex = 1; // while (colIndex <= usedRange.Columns.Count) // <- change here also while (colIndex <= worksheet.UsedRange.Columns.Count) { if (excel.WorksheetFunction.CountA(worksheet.Cells[1, colIndex].EntireColumn) == 0) { worksheet.Cells[1, colIndex].EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft); } else { colIndex++; } } }
由@Hadi更新
你可以改变DeleteCols
和DeleteRows
函数,以获得更好的性能,如果excel包含最后使用的后面多余的空白行和列:
private static void DeleteRows(List<int> rowsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet) { // the rows are sorted high to low - so index's wont shift List<int> NonEmptyRows = Enumerable.Range(1, rowsToDelete.Max()).ToList().Except(rowsToDelete).ToList(); if (NonEmptyRows.Max() < rowsToDelete.Max()) { // there are empty rows after the last non empty row Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[NonEmptyRows.Max() + 1,1]; Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[rowsToDelete.Max(), 1]; //Delete all empty rows after the last used row worksheet.Range[cell1, cell2].EntireRow.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftUp); } //else last non empty row = worksheet.Rows.Count foreach (int rowIndex in rowsToDelete.Where(x => x < NonEmptyRows.Max())) { worksheet.Rows[rowIndex].Delete(); } } private static void DeleteCols(List<int> colsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet) { // the cols are sorted high to low - so index's wont shift //Get non Empty Cols List<int> NonEmptyCols = Enumerable.Range(1, colsToDelete.Max()).ToList().Except(colsToDelete).ToList(); if (NonEmptyCols.Max() < colsToDelete.Max()) { // there are empty rows after the last non empty row Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[1,NonEmptyCols.Max() + 1]; Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[1,NonEmptyCols.Max()]; //Delete all empty rows after the last used row worksheet.Range[cell1, cell2].EntireColumn.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftToLeft); } //else last non empty column = worksheet.Columns.Count foreach (int colIndex in colsToDelete.Where(x => x < NonEmptyCols.Max())) { worksheet.Columns[colIndex].Delete(); } }
检查我的答案在使用Interop从Excel中获取最后一个非空列和行索引
也许要考虑一下:
Sub usedRangeDeleteRowsCols() Dim LastRow, LastCol, i As Long LastRow = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByRows).Row LastCol = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByColumns).Column For i = LastRow To 1 Step -1 If WorksheetFunction.CountA(Range(Cells(i, 1), Cells(i, LastCol))) = 0 Then Cells(i, 1).EntireRow.Delete End If Next For i = LastCol To 1 Step -1 If WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i))) = 0 Then Cells(1, i).EntireColumn.Delete End If Next End Sub
我认为与原始代码中的等效函数相比有两个效率。 首先,我们不使用Excel的不可靠的UsedRange属性,而是查找最后一个值,只扫描真正使用的范围内的行和列。
其次,工作表计数function只能在真正使用的范围内工作 – 例如search空行时,我们只查看使用的列的范围(而不是.EntireRow
)。
For
循环向后工作是因为,例如,每当删除一行时,后续数据的行地址都会改变。 反向工作意味着“要处理的数据”的行地址不会改变。
在我看来,最耗时的部分可能是枚举和查找空行和列。
那么: http : //www.howtogeek.com/206696/how-to-quickly-and-easily-delete-blank-rows-and-columns-in-excel-2013/
编辑:
关于什么:
m_XlWrkSheet.Columns("A:A").SpecialCells(xlCellTypeBlanks).EntireRow.Delete m_XlWrkSheet.Rows("1:1").SpecialCells(xlCellTypeBlanks).EntireColumn.Delete
testing样本数据结果看起来不错,性能更好(从VBAtesting,但差异巨大)。
更新:
testing样本Excel 14k行(由样本数据制作)原始代码〜30s,此版本<1s
我知道的最简单的方法是隐藏非空白单元格并删除可见的单元格:
var range = m_XlWrkSheet.UsedRange; range.SpecialCells(XlCellType.xlCellTypeConstants).EntireRow.Hidden = true; range.SpecialCells(XlCellType.xlCellTypeVisible).Delete(XlDeleteShiftDirection.xlShiftUp); range.EntireRow.Hidden = false;
更快的方法是不删除任何东西,而是移动(剪切+粘贴)非空白区域。
最快的Interop方法(没有打开文件就有更快更复杂的方法)是获取数组中的所有值,移动数组中的值,并将值返回:
object[,] values = m_XlWrkSheet.UsedRange.Value2 as object[,]; // some code here (the values start from values[1, 1] not values[0, 0]) m_XlWrkSheet.UsedRange.Value2 = values;
您可以打开与工作表的ADO连接,获取字段列表,发出只包含已知字段的SQL语句,还可以在已知字段中排除没有值的logging。