编写一个程序,从10亿个数字中找出100个最大的数字
我最近参加了一个采访,在那里我被要求“写一个程序,从10亿个数字中找出100个最大的数字”。
我只能给出一个暴力解决scheme,它是在O(nlogn)时间复杂度sorting数组,并采取最后100个数字。
Arrays.sort(array);
面试官正在寻找一个更好的时间复杂性,我试了一些其他的解决scheme,但没有回答他。 有更好的时间复杂性解决scheme吗?
您可以保留100个最大号码的优先队列,遍历十亿个号码,每当遇到一个大于队列中最小号码的数字(队列头部)时,删除队列头部并添加新号码到队列。
编辑:正如开发人员指出的,用堆实现的优先级队列,插入队列的复杂性是O(logN)
在最坏的情况下,你可以得到billion log 2 (100)
,这比billion log 2 (billion)
一般情况下,如果需要一组N个数中最大的K个数,则复杂度为O(NlogK)
而不是O(NlogN)
,当K相对于N很小时,这可能非常显着。
EDIT2:
这个algorithm的预期时间非常有趣,因为在每次迭代中插入可能会或可能不会发生。 第i个数字被插入到队列中的概率是随机variables大于来自相同分布的至lessiK
随机variables(前k个数字被自动添加到队列中)的概率。 我们可以使用订单统计(见链接 )来计算这个概率。 例如,假设数字是从{0, 1}
中随机select的,第(iK)个数字的期望值(ik)/i
,随机variables的概率大于这个值是1-[(ik)/i] = k/i
。
因此,预期的插入次数是:
预期的运行时间可以表示为:
( k
时间产生具有前k
元素的队列,然后nk
比较,以及如上所述的期望的插入次数,每个取平均log(k)/2
时间)
请注意,当N
与K
相比非常大时,这个expression式更接近n
而不是NlogK
。 这是有些直观的,就像在这个问题的情况下一样,即使在10000次迭代之后(与十亿次相比,这个数字非常小),一个数字被插入到队列中的机会是非常小的。
如果在采访中提到这个问题,我认为面试官可能想看看你的问题解决过程,而不仅仅是你对algorithm的了解。
这个描述是相当一般的,所以也许你可以问他这些数字的范围或含义,以便清楚地说明问题。 这样做可能会给采访者留下深刻的印象。 例如,如果这些数字代表一个国家(例如中国)的人的年龄,那么这是一个更容易的问题。 如果一个合理的假设是,没有人活着超过200,那么可以使用一个大小为200(也许是201)的int数组来计算一次迭代中具有相同年龄的人数。 这里指数是指年龄。 在这之后,find100个最大的数字是一块蛋糕。 顺便说一下,这个algorithm被称为计数sorting 。
无论如何,让这个问题更具体和更清晰,对你来说是有好处的。
你可以遍历数字,这需要O(n)
只要您发现一个大于当前最小值的值,就将新值添加到大小为100的循环队列中。
该循环队列的最小值是您的新比较值。 继续添加到该队列。 如果已满,请从队列中提取最小值。
我意识到这是用'algorithm'标记的,但是会抛出一些其他选项,因为它可能也应该被标记为“采访”。
十亿个数字的来源是什么? 如果它是一个数据库,那么“从表格中按值sorting值select值限制100”将很好地完成这项工作 – 可能存在方言差异。
这是一次性的,还是会重复的? 如果重复,多频繁? 如果是一次性的,并且数据在一个文件中,那么'cat srcfile | sorting(根据需要select)| 头-100'将让你快速做高效率的工作,你可以在计算机处理这个微不足道的事情时得到报酬。
如果重复,你会build议采取任何体面的方法来获得最初的答案,并存储/caching结果,以便您可以连续报告前100名。
最后,有这个考虑。 您是否在寻找入门级的工作,并与一位令人讨厌的经理或未来的同事面谈? 如果是这样,那么你可以折腾出各种方式来描述相对的技术优劣。 如果你正在寻找一个更具pipe理性的工作,那么就像一个经理一样来对待它,关心解决scheme的开发和维护成本,并且说“非常感谢你”,如果这是面试官想要专注于CS琐事。 他和你在这里不太可能有太大的进步。
在下次面试中运气更好。
您可以使用快速selectalgorithm来查找(按订单)索引[亿 – 101]的数字,然后遍历数字,并find从该数字biger的数字。
array={...the billion numbers...} result[100]; pivot=QuickSelect(array,billion-101);//O(N) for(i=0;i<billion;i++)//O(N) if(array[i]>=pivot) result.add(array[i]);
这个algorithm时间是:2 XO(N)= O(N)(平均情况下的性能)
Thomas Jungblut的第二个select是:
使用堆构buildMAX堆将采取O(N),然后排名前100的最大数将在堆的顶部,所有你需要的是从堆中取出(100 XO(Log(N))。
这个algorithm时间是:O(N)+ 100 XO(Log(N))= O(N)
我对此的直接反应是使用堆,但是有一种方法可以使用QuickSelect,而不必在任何时候保留所有的input值。
创build一个大小为200的数组,并用前200个input值填充它。 运行QuickSelect并丢弃低100,留下100个空闲的地方。 读入下一个100个input值并再次运行QuickSelect。 继续,直到你已经完成100个批次的整个input。
最后你有100个值。 对于N值,您已经运行QuickSelect大约N / 100次。 每个QuickSelect的成本大约是一些常数的200倍,所以总成本是一些常数的2N倍。 这看起来与我的input大小是线性的,不pipe我在这个解释中硬连线的参数大小是多less。
尽pipe其他快速select解决scheme已经被低估,但事实是快速select将比使用大小100的队列更快地find解决scheme。就比较而言,QuickSelect具有2n + o(n)的预期运行时间。 一个非常简单的实现将是
array = input array of length n r = Quickselect(array,n-100) result = array of length 100 for(i = 1 to n) if(array[i]>r) add array[i] to result
这将需要平均3n + o(n)的比较。 而且,使用快速select可以使arrays中最大的100个项目位于最右边的100个位置,因此可以提高效率。 所以实际上运行时间可以提高到2n + o(n)。
存在这样的问题,即这是预期的运行时间,而不是最坏的情况,但是通过使用适当的枢轴select策略(例如,随机挑选21个元素,并select那些21的中值作为枢轴),则比较次数可以是对于一个任意小的常数c,至多保证(2 + c)n的概率很高。
事实上,通过使用优化的抽样策略(例如随机抽样sqrt(n)个元素,并select第99个百分位数),可以将运行时间降至(1 + c)n + o(n) (假定K,要被select的元素的数量是o(n))。
另一方面,使用大小为100的队列将需要O(log(100)n)比较,并且100的对数基数2近似等于6.6。
如果我们从更大的抽象意义上来考虑这个问题,即从大小为N的数组中select最大的K个元素,其中K = o(N),但是K和N都无穷大,那么快速select版本的运行时间将是O(N),队列版本将是O(N log K),所以从这个意义上来说quickselect也是渐近优越的。
在评论中提到,队列解决scheme将在随机input的预期时间N + K log N上运行。 当然,随机input的假设是永远不会有效的,除非问题明确指出。 可以使队列解决scheme以随机顺序遍历数组,但是这将导致向随机数生成器的N个调用的附加成本,以及要么排列整个input数组,要么分配长度为N的包含随机指数。
如果问题不允许在原始数组中移动元素,并且分配内存的成本很高,因此重复数组不是一个选项,那是不同的。 但严格来说,在运行时间方面,这是最好的解决scheme。
把100亿的前100个数字sorting。 现在只需遍历十亿,如果源数量高于最小的100,按sorting顺序插入。 你最终得到的是一个更接近O(n)的大小的集合。
两个选项:
(1)堆(priorityQueue)
保持大小为100的最小堆。遍历数组。 一旦元素小于堆中的第一个元素,请将其replace。
InSERT ELEMENT INTO HEAP: O(log100) compare the first element: O(1) There are n elements in the array, so the total would be O(nlog100), which is O(n)
(2)地图缩小模型。
这与hadoop中的字数统计非常相似。 地图工作:统计每个元素出现的频率或次数。 减less:获得最高K元素。
通常我会给招聘人员两个答案。 给他们任何他们喜欢的。 当然,减less地图编码会是一件很麻烦的事,因为你必须知道每一个确切的参数。 没有伤害实践它。 祝你好运。
一个非常简单的解决scheme是遍历数组100遍。 这是O(n)
。
每次拉出最大的数字(并将其值更改为最小值,以便在下一次迭代中不会看到它,或者跟踪以前答案的索引(通过跟踪原始数组的索引)多个相同的数字))。 经过100次迭代,你有100个最大的数字。
受@ron teller的答复启发,这里是一个准系统C程序来做你想做的事情。
#include <stdlib.h> #include <stdio.h> #define TOTAL_NUMBERS 1000000000 #define N_TOP_NUMBERS 100 int compare_function(const void *first, const void *second) { int a = *((int *) first); int b = *((int *) second); if (a > b){ return 1; } if (a < b){ return -1; } return 0; } int main(int argc, char ** argv) { if(argc != 2){ printf("please supply a path to a binary file containing 1000000000" "integers of this machine's wordlength and endianness\n"); exit(1); } FILE * f = fopen(argv[1], "r"); if(!f){ exit(1); } int top100[N_TOP_NUMBERS] = {0}; int sorts = 0; for (int i = 0; i < TOTAL_NUMBERS; i++){ int number; int ok; ok = fread(&number, sizeof(int), 1, f); if(!ok){ printf("not enough numbers!\n"); break; } if(number > top100[0]){ sorts++; top100[0] = number; qsort(top100, N_TOP_NUMBERS, sizeof(int), compare_function); } } printf("%d sorts made\n" "the top 100 integers in %s are:\n", sorts, argv[1] ); for (int i = 0; i < N_TOP_NUMBERS; i++){ printf("%d\n", top100[i]); } fclose(f); exit(0); }
在我的机器(具有快速SSD的核心i3)上需要25秒,1724种。 我用dd if=/dev/urandom/ count=1000000000 bs=1
生成了一个二进制文件dd if=/dev/urandom/ count=1000000000 bs=1
。
很明显,一次只能读取4个字节的性能问题(从磁盘读取),但这只是一个例子。 好的一面,只需要很less的内存。
最简单的解决scheme是扫描十亿个大数组,并将迄今为止发现的100个最大值保存在一个小数组缓冲区中,而不进行任何sorting,并记住此缓冲区的最小值。 首先,我认为这种方法是由fordprefect提出的,但是在他的评论中他说,他假定100号码数据结构被实现为堆。 只要发现一个新的数字大于缓冲区中的最小值就会被find的新数值覆盖,缓冲区将再次search当前的最小值。 如果十亿个数组中的数字在大部分时间是随机分布的,则将大数组中的数值与小数组的最小值进行比较并丢弃。 只有非常小的一部分数值才能被插入到小数组中。 所以操纵数据结构的小数的差别是可以忽略的。 对于less数元素,很难确定优先级队列的使用是否比使用我的天真方法更快。
我想估计在扫描10 ^ 9元素数组时,小100元素数组缓冲区中插入的数量。 程序扫描这个大数组的前1000个元素,并且必须在缓冲区中插入至多1000个元素。 缓冲区包含扫描的1000个元素中的100个元素,即扫描的元素的0.1。 所以我们假设大数组的值大于缓冲区的当前最小值的概率大约是0.1这样的元素必须插入到缓冲区中。 现在程序从大数组中扫描下一个10 ^ 4个元素。 因为每次插入新元素时缓冲区的最小值都会增加。 我们估计大于我们当前最小值的元素比例大约是0.1,所以有0.1 * 10 ^ 4 = 1000个元素可以插入。 实际上,插入缓冲区的元素的预期数目将会更小。 扫描完这个10 ^ 4个元素后,缓冲区中的数字将会是目前扫描的元素的0.01。 因此,当扫描下一个10 ^ 5的数字时,我们假设不超过0.01 * 10 ^ 5 = 1000将被插入缓冲区。 继续这个论证,我们在扫描大arrays的1000 + 10 ^ 4 + 10 ^ 5 + … + 10 ^ 9〜10 ^ 9个元素后插入了大约7000个值。 所以当扫描一个10 ^ 9的随机大小的数组时,我们期望在缓冲区中不会超过10 ^ 4个(= 7000)。 在每次插入缓冲区之后,必须find新的最小值。 如果缓冲区是一个简单的数组,我们需要100比较才能find新的最小值。 如果缓冲区是另一个数据结构(如堆),我们至less需要1比较才能find最小值。 为了比较大数组的元素,我们需要10 ^ 9比较。 所以总的来说,当使用一个数组作为缓冲区时,我们需要大约10 ^ 9 + 100 * 10 ^ 4 = 1.001 * 10 ^ 9的比较,而在使用另一种types的数据结构(比如堆)时至less需要1.000 * 10 ^ 9的比较。 。 所以如果性能是由比较数决定的,那么使用堆只会带来0.1%的增益。 但是,在100个元素堆中插入一个元素和replace100个元素数组中的一个元素并find新的最小值之间的执行时间有什么区别?
-
在理论层面上:插入堆中需要多less次比较。 我知道它是O(log(n)),但是常量因子有多大? 一世
-
在机器级别:高速caching和分支预测对堆插入的执行时间和数组中的线性search有什么影响。
-
在实现级别:图书馆或编译器提供的堆数据结构中隐藏了哪些附加成本?
我认为这些都是需要回答的一些问题,然后才能尝试估计100元素堆或100元素arrays的性能之间的真正差异。 所以做实验和衡量真实的performance是有意义的。
Although in this question we should search for top 100 numbers, I will generalize things and write x. Still, I will treat x as constant value.
algorithmn中最大的x个元素:
我将调用返回值LIST 。 它是一组x元素(在我看来应该是链表)
- 第一个x元素是从“当他们来到”池中取出的,并在LIST中进行sorting(这是在恒定时间内完成的,因为x被视为常量 – O(x log(x))时间)
- 对于接下来的每个元素,我们检查它是否大于LIST中的最小元素,如果我们popup最小值,并将当前元素插入LIST。 既然是有序列表,每个元素都应该在对数时间(二进制search)find它的位置,因为它是有序列表插入不是一个问题。 每一步也是在恒定的时间(O(log(x))时间)完成的。
那么,最坏的情况是什么?
x log(x)+(nx)(log(x)+1)= nlog(x)+ n – x
所以这是最坏情况下的O(n)时间。 +1是检查数字是否大于LIST中的最小数字。 平均情况下的预期时间取决于这n个元素的math分布。
可能的改进
This algorithm can be slightly improved for worst case scenario but IMHO (I can not prove this claim) that will degrade average behavior. Asymptotic behavior will be the same.
Improvement in this algorithm will be that we will not check if element is greater than smallest. For each element we will try to insert it and if it is smaller than smallest we will disregard it. Although that sounds preposterous if we regard only the worst case scenario we will have
x log(x) + (nx)log(x) = nlog(x)
operations.
For this use case I don't see any further improvements. Yet you must ask yourself – what if I have to do this more than log(n) times and for different x-es? Obviously we would sort that array in O(n log(n)) and take our x element whenever we need them.
This question would be answered with N log(100) complexity (instead of N log N) with just one line of C++ code.
std::vector<int> myvector = ...; // Define your 1 billion numbers. // Assumed integer just for concreteness std::partial_sort (myvector.begin(), myvector.begin()+100, myvector.end());
The final answer would be a vector where the first 100 elements are guaranteed to be the 100 biggest numbers of you array while the remaining elements are unordered
C++ STL (standard library) is quite handy for this kind of problems.
Note: I am not saying that this is the optimal solution, but it would have saved your interview.
The simple solution would be using a priority queue, adding the first 100 numbers to the queue and keeping track of the smallest number in the queue, then iterating through the other billion numbers, and each time we find one that is larger than the largest number in the priority queue, we remove the smallest number, add the new number, and again keep track of the smallest number in the queue.
If the numbers were in random order, this would work beautiful because as we iterate through a billion random numbers, it would be very rare that the next number is among the 100 largest so far. But the numbers might not be random. If the array was already sorted in ascending order then we would always insert an element to the priority queue.
So we pick say 100,000 random numbers from the array first. To avoid random access which might be slow, we add say 400 random groups of 250 consecutive numbers. With that random selection, we can be quite sure that very few of the remaining numbers are in the top hundred, so the execution time will be very close to that of a simple loop comparing a billion numbers to some maximum value.
I have written up a simple solution in Python in case anyone is interested. It uses the bisect
module and a temporary return list which it keeps sorted. This is similar to a priority queue implementation.
import bisect def kLargest(A, k): '''returns list of k largest integers in A''' ret = [] for i, a in enumerate(A): # For first k elements, simply construct sorted temp list # It is treated similarly to a priority queue if i < k: bisect.insort(ret, a) # properly inserts a into sorted list ret # Iterate over rest of array # Replace and update return array when more optimal element is found else: if a > ret[0]: del ret[0] # pop min element off queue bisect.insort(ret, a) # properly inserts a into sorted list ret return ret
Usage with 100,000,000 elements and worst-case input which is a sorted list:
>>> from so import kLargest >>> kLargest(range(100000000), 100) [99999900, 99999901, 99999902, 99999903, 99999904, 99999905, 99999906, 99999907, 99999908, 99999909, 99999910, 99999911, 99999912, 99999913, 99999914, 99999915, 99999916, 99999917, 99999918, 99999919, 99999920, 99999921, 99999922, 99999923, 99999924, 99999925, 99999926, 99999927, 99999928, 99999929, 99999930, 99999931, 99999932, 99999933, 99999934, 99999935, 99999936, 99999937, 99999938, 99999939, 99999940, 99999941, 99999942, 99999943, 99999944, 99999945, 99999946, 99999947, 99999948, 99999949, 99999950, 99999951, 99999952, 99999953, 99999954, 99999955, 99999956, 99999957, 99999958, 99999959, 99999960, 99999961, 99999962, 99999963, 99999964, 99999965, 99999966, 99999967, 99999968, 99999969, 99999970, 99999971, 99999972, 99999973, 99999974, 99999975, 99999976, 99999977, 99999978, 99999979, 99999980, 99999981, 99999982, 99999983, 99999984, 99999985, 99999986, 99999987, 99999988, 99999989, 99999990, 99999991, 99999992, 99999993, 99999994, 99999995, 99999996, 99999997, 99999998, 99999999]
It took about 40 seconds to calculate this for 100,000,000 elements so I'm scared to do it for 1 billion. To be fair though, I was feeding it the worst-case input (ironically an array that is already sorted).
I see a lot of O(N) discussions, so I propose something different just for the thought exercise.
Is there any known information about the nature of these numbers? If it's random in nature, then go no further and look at the other answers. You won't get any better results than they do.
However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.
There's some food for thought anyway. Maybe this will help you give future interviewers a thoughtful answer. I know I would be impressed if someone asked me such a question in response to a problem like this – it would tell me that they are thinking of optimization. Just recognize that there may not always be a possibility to optimize.
Time ~ O(100 * N) Space ~ O(100 + N)
-
Create an empty list of 100 empty slot
-
For every number in input-list:
-
If the number is smaller than the first one, skip
-
Otherwise replace it with this number
-
Then, push the number through adjacent swap; until it's smaller than the next one
-
-
Return the list
Note: if the log(input-list.size) + c < 100
, then the optimal way is to sort the input-list, then split first 100 items.
THe complexity is O(N)
First create an array of 100 ints initialiaze the first element of this array as the first element of the N values, keep track of the index of the current element with a another variable, call it CurrentBig
Iterate though the N values
if N[i] > M[CurrentBig] { M[CurrentBig]=N[i]; ( overwrite the current value with the newly found larger number) CurrentBig++; ( go to the next position in the M array) CurrentBig %= 100; ( modulo arithmetic saves you from using lists/hashes etc.) M[CurrentBig]=N[i]; ( pick up the current value again to use it for the next Iteration of the N array) }
when done , print the M array from CurrentBig 100 times modulo 100 🙂 For the student: make sure that the last line of the code does not trump valid data right before the code exits
Another O(n) algorithm –
The algorithm finds the largest 100 by elimination
consider all the million numbers in their binary representation. Start from the most significant bit. Finding if the MSB is 1 can be a done by a boolean operation multiplication with an appropriate number. If there are more than 100 1's in these million eliminate the other numbers with zeros. Now of the remaining numbers proceed with the next most significant bit. keep a count of the number of remaining numbers after elimination and proceed as long as this number is greater than 100.
The major boolean operation can be an parallely done on GPUs
I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.
You can do it in O(n)
time. Just iterate through the list and keep track of the 100 biggest numbers you've seen at any given point and the minimum value in that group. When you find a new number bigger the smallest of your ten, then replace it and update your new min value of the 100 (may take a constant time of 100 to determine this each time you do it, but this does not affect the overall analysis).
- Use nth-element to get the 100'th element O(n)
- Iterate the second time but only once and output every element that is greater than this specific element.
Please note esp. the second step might be easy to compute in parallel! And it will also be efficiently when you need a million biggest elements.
It's a question from Google or some else industry giants.Maybe the following code is the right answer expected by your interviewer. The time cost and space cost depend on the maximum number in the input array.For 32-Bit int array input, The maximum space cost is 4 * 125M Bytes, Time cost is 5 * Billion.
public class TopNumber { public static void main(String[] args) { final int input[] = {2389,8922,3382,6982,5231,8934 ,4322,7922,6892,5224,4829,3829 ,6892,6872,4682,6723,8923,3492}; //One int(4 bytes) hold 32 = 2^5 value, //About 4 * 125M Bytes //int sort[] = new int[1 << (32 - 5)]; //Allocate small array for local test int sort[] = new int[1000]; //Set all bit to 0 for(int index = 0; index < sort.length; index++){ sort[index] = 0; } for(int number : input){ sort[number >>> 5] |= (1 << (number % 32)); } int topNum = 0; outer: for(int index = sort.length - 1; index >= 0; index--){ if(0 != sort[index]){ for(int bit = 31; bit >= 0; bit--){ if(0 != (sort[index] & (1 << bit))){ System.out.println((index << 5) + bit); topNum++; if(topNum >= 3){ break outer; } } } } } } }
i did my own code,not sure if its what the "interviewer" it's looking
private static final int MAX=100; PriorityQueue<Integer> queue = new PriorityQueue<>(MAX); queue.add(array[0]); for (int i=1;i<array.length;i++) { if(queue.peek()<array[i]) { if(queue.size() >=MAX) { queue.poll(); } queue.add(array[i]); } }
Possible improvements.
If the file contains 1 billions number, reading it could be really long…
To improve this working you can :
- Split the file into n parts, Create n threads, make n threads look each for the 100 biggest numbers in their part of the file (using the priority queue), and finally get the 100 biggest numbers of all threads output.
- Use a cluster to do a such task, with a solution like hadoop. Here you can split the file even more and have the output quicker for a 1 billion (or a 10^12) numbers file.
This code is for finding N largest numbers in an Unsorted array .
#include <iostream> using namespace std; #define Array_Size 5 // No Of Largest Numbers To Find #define BILLION 10000000000 void findLargest(int max[], int array[]); int checkDup(int temp, int max[]); int main() { int array[BILLION] // contains data int i=0, temp; int max[Array_Size]; findLargest(max,array); cout<< "The "<< Array_Size<< " largest numbers in the array are: \n"; for(i=0; i< Array_Size; i++) cout<< max[i] << endl; return 0; } void findLargest(int max[], int array[]) { int i,temp,res; for(int k=0; k< Array_Size; k++) { i=0; while(i < BILLION) { for(int j=0; j< Array_Size ; j++) { temp = array[i]; res= checkDup(temp,max); if(res == 0 && max[j] < temp) max[j] = temp; } i++; } } } int checkDup(int temp, int max[]) { for(int i=0; i<N_O_L_N_T_F; i++) { if(max[i] == temp) return -1; } return 0; }
This might not be the efficient one but gets the job done.
希望这可以帮助
I know this might get buried, but here is my idea for a variation on a radix MSD
.
pseudo-code:
//billion is the array of 1 billion numbers int[] billion = getMyBillionNumbers(); //this assumes these are 32-bit integers and we are using hex digits int[][] mynums = int[8][16]; for number in billion putInTop100Array(number) function putInTop100Array(number){ //basically if we got past all the digits successfully if(number == null) return true; msdIdx = getMsdIdx(number); msd = getMsd(number); //check if the idx above where we are is already full if(mynums[msdIdx][msd+1] > 99) { return false; } else if(putInTop100Array(removeMSD(number)){ mynums[msdIdx][msd]++; //we've found 100 digits here, no need to keep looking below where we are if(mynums[msdIdx][msd] > 99){ for(int i = 0; i < mds; i++){ //making it 101 just so we can tell the difference //between numbers where we actually found 101, and //where we just set it mynums[msdIdx][i] = 101; } } return true; } return false; }
The function getMsdIdx(int num)
would return the index of the most significant digit (non-zero). The function getMsd(int num)
would return the most significant digit. The funciton removeMSD(int num)
would remove the most significant digit from a number and return the number (or return null if there was nothing left after removing the most significant digit).
Once this is done, all that is left is traversing mynums
to grab the top 100 digits. This would be something like:
int[] nums = int[100]; int idx = 0; for(int i = 7; i >= 0; i--){ int timesAdded = 0; for(int j = 16; j >=0 && timesAdded < 100; j--){ for(int k = mynums[i][j]; k > 0; k--){ nums[idx] += j; timesAdded++; idx++; } } }
I should note that although the above looks like it has high time complexity, it will really only be around O(7*100)
.
A quick explanation of what this is trying to do: Essentially this system is trying to use every digit in a 2d-array based upon the index of the digit in the number, and the digit's value. It uses these as indexes to keep track of how many numbers of that value have been inserted in the array. When 100 has been reached, it closes off all "lower branches".
The time of this algorithm is something like O(billion*log(16)*7)+O(100)
. I could be wrong about that. Also it is very likely this needs debugging as it is kinda complex and I just wrote it off the top of my head.
EDIT: Downvotes without explanation are not helpful. If you think this answer is incorrect, please leave a comment why. Pretty sure that StackOverflow even tells you to do so when you downvote.
Managing a separate list is extra work and you have to move things around the whole list every time you find another replacement. Just qsort it and take the top 100.
Problem: Find m largest elements of n items where n >>> m
The simplest solution, that should be obvious to everyone is to simply do m passes of the bubble sort algorithm.
then print out the last n elements of the array.
This requires no external data structures, and uses an algorithm that everyone knows.
Running time estimate is O(m*n). The best answers so far is O(n log(m)), so this solution is not significantly more expensive for small m.
I'm not saying this couldn't be improved, but this is by far the simplest solution.