MySQL查询以获得有限的许多查询的“交集”
假设我有一个单一的mySQL表(用户)与以下字段:
userid gender region age ethnicity income
我希望能够根据用户input的数字返回总logging数。 此外,他们也将提供额外的标准。
最简单的例子是,他们可能会要求1,000条logging,其中600条logging应该有性别=“男性”,400条logging的性别=“女性”。 这很简单。
现在,更进一步。 假设他们现在想要指定区域:
GENDER Male: 600 records Female: 400 records REGION North: 100 records South: 200 records East: 300 records West: 400 records
再次,只有1000条logging应该返回,但最后必须有600名男性,400名女性,100名北方人,200名南方人,300名东方人和400名西方人。
我知道这是不正确的语法,但使用伪mySQL代码,它希望说明我正在尝试做什么:
(SELECT * FROM users WHERE gender = 'Male' LIMIT 600 UNION SELECT * FROM users WHERE gender = 'Female' LIMIT 400) INTERSECT (SELECT * FROM users WHERE region = 'North' LIMIT 100 UNION SELECT * FROM users WHERE region = 'South' LIMIT 200 UNION SELECT * FROM users WHERE region = 'East' LIMIT 300 UNION SELECT * FROM users WHERE region = 'West' LIMIT 400)
请注意,我不是在寻找一次性查询。 根据用户的input,每个标准内的logging总数和logging数量将不断变化。 所以,我试图想出一个可以反复使用的通用解决scheme,而不是一个硬编码的解决scheme。
为了使事情更复杂,现在添加更多的标准。 也可能有年龄,种族和收入,每个组别都有自己的logging数,附加的代码附加在上面:
INTERSECT (SELECT * FROM users WHERE age >= 18 and age <= 24 LIMIT 300 UNION SELECT * FROM users WHERE age >= 25 and age <= 36 LIMIT 200 UNION SELECT * FROM users WHERE age >= 37 and age <= 54 LIMIT 200 UNION SELECT * FROM users WHERE age >= 55 LIMIT 300) INTERSECT etc.
我不确定是否可以在一个查询中写入,或者这需要多个语句和迭代。
压扁你的标准
您可以将多维标准压缩到单个标准
现在这个标准可以在一个查询中实现如下
(SELECT * FROM users WHERE gender = 'Male' AND region = 'North' LIMIT 40) UNION ALL (SELECT * FROM users WHERE gender = 'Male' AND region = 'South' LIMIT 80) UNION ALL (SELECT * FROM users WHERE gender = 'Male' AND region = 'East' LIMIT 120) UNION ALL (SELECT * FROM users WHERE gender = 'Male' AND region = 'West' LIMIT 160) UNION ALL (SELECT * FROM users WHERE gender = 'Female' AND region = 'North' LIMIT 60) UNION ALL (SELECT * FROM users WHERE gender = 'Female' AND region = 'South' LIMIT 120) UNION ALL (SELECT * FROM users WHERE gender = 'Female' AND region = 'East' LIMIT 180) UNION ALL (SELECT * FROM users WHERE gender = 'Female' AND region = 'West' LIMIT 240)
问题
- 它并不总是返回正确的结果。 例如,如果男性和北方的用户less于40个,那么查询将返回less于1,000条logging。
调整您的标准
假设有不到40名用户是男性和北部的用户。 然后,您需要调整其他标准数量以涵盖“男性”和“北方”的缺失数量。 我相信这是不可能与裸露的SQL。 这是我想到的伪代码。 为了简化,我想我们只会询问男性,女性,北方和南方
conditions.add({ gender: 'Male', region: 'North', limit: 40 }) conditions.add({ gender: 'Male', region: 'South', limit: 80 }) conditions.add({ gender: 'Female', region: 'North', limit: 60 }) conditions.add({ gender: 'Female', region: 'South', limit: 120 }) foreach(conditions as condition) { temp = getResultFromDatabaseByCondition(condition) conditions.remove(condition) // there is not enough result for this condition, // increase other condition quantity if (temp.length < condition.limit) { adjust(...); } }
比方说,只有30个北方男性。 所以我们需要调整+10男,+10北调。
To Adjust --------------------------------------------------- Male +10 North +10 Remain Conditions ---------------------------------------------------- { gender: 'Male', region: 'South', limit: 80 } { gender: 'Female', region: 'North', limit: 60 } { gender: 'Female', region: 'South', limit: 120 }
“男”+“南”是符合“男”调整条件的第一个条件。 增加+10,并从“剩余条件”列表中删除。 因为,我们增加南方,我们需要在其他情况下减less。 所以在“调整”列表中添加“南”条件
To Adjust --------------------------------------------------- South -10 North +10 Remain Conditions ---------------------------------------------------- { gender: 'Female', region: 'North', limit: 60 } { gender: 'Female', region: 'South', limit: 120 } Final Conditions ---------------------------------------------------- { gender: 'Male', region: 'South', limit: 90 }
find符合“南”的条件并重复相同的过程。
To Adjust --------------------------------------------------- Female +10 North +10 Remain Conditions ---------------------------------------------------- { gender: 'Female', region: 'North', limit: 60 } Final Conditions ---------------------------------------------------- { gender: 'Female', region: 'South', limit: 110 } { gender: 'Male', region: 'South', limit: 90 }
最后
{ gender: 'Female', region: 'North', limit: 70 } { gender: 'Female', region: 'South', limit: 110 } { gender: 'Male', region: 'South', limit: 90 }
我还没有拿出调整的具体实施。 这比我预料的更困难。 我会更新一次,我可以弄清楚如何实现它。
您所描述的问题是一个多维build模问题。 尤其是,您正试图同时在多个维度上获得分层样本。 关键是要降到最小的粒度,并从那里build立样本。
我进一步假设你希望样本在各个层面都具有代表性。 也就是说,你不希望所有来自“北方”的用户都是女性。 或者所有的“男性”来自“西方”,即使这样做符合最终的标准。
首先考虑每个维度的logging,维度和分配总数。 例如,对于第一个样本,可以认为是:
- 1000条logging
- 2个维度:性别,地区
- 性别分割:60%,40%
- 区域分割:10%,20%,30%,40%
然后,你想分配这些数字到每个性别/地区的组合。 数字是:
- 北,男:60
- 北,女:40
- 南,男:120
- 南,女:80
- 东,男:180
- 东,女:120
- 西,男:240
- 西,女:160
你会看到这些在维度上加起来。
计算每个单元格中的数字是相当容易的。 这是百分比乘以总数的乘积。 所以,“东,女”是30%* 40%* 1000。 。 。 瞧! 值是120。
这是解决scheme:
- 以每个维度的input作为总数的百分比 。 并确保它们在每个维度上加起来达到100%。
- 为每个单元创build一个预期百分比的表格。 这是每个维度百分比的乘积。
- 将预期百分比乘以总数。
- 最后的查询概述如下。
假设您有一个具有预期计数和原始数据( users
)的表格cells
。
select enumerated.* from (select u.*, (@rn := if(@dims = concat_ws(':', dim1, dim2, dim3), @rn + 1, if(@dims := concat_ws(':', dim1, dim2, dim3), 1, 1) ) ) as seqnum from users u cross join (select @dims = '', @rn := '') vars order by dim1, dim2, dim3, rand() ) enumerated join cells on enumerated.dims = cells.dims where enuemrated.seqnum <= cells.expectedcount;
请注意,这是解决scheme的草图。 你必须填写有关尺寸的细节。
只要你有足够的数据为所有单元格,这将工作。
在实践中,当进行这种多维分层采样时,确实存在细胞空白或过小的风险。 发生这种情况时,通常可以通过一个额外的通行证来解决这个问题。 从足够大的细胞中取出你能做的。 这些通常占所需数据的大部分。 然后添加logging以符合最终计数。 要添加的logging是那些值与最需要的维度相匹配的logging。 但是,这个解决scheme只是假定有足够的数据来满足您的标准。
您的请求存在的问题是,可以使用大量的选项来实现build议的数字:
Male Female Sum ----------------------------- North: 100 0 100 South: 200 0 200 East: 300 0 300 West: 0 400 400 Sum: 600 400 ----------------------------- North: 99 1 100 South: 200 0 200 East: 300 0 300 West: 1 399 400 Sum: 600 400 ----------------------------- .... ----------------------------- North: 0 100 100 South: 200 0 200 East: 0 300 300 West: 400 0 400 Sum: 600 400
只要把北,东,西结合起来(南方总是男性:200),就可以得到400个可能的数字。 如果每个“ class ”(男/北=“ class ”)的logging数量有限,情况会变得更加复杂。
上表中的每个单元格可能需要最多MIN(COUNT(gender), COUNT(location))
logging(对于其对应部分为零的情况)。
这取决于:
Male Female --------------------- North: 100 100 South: 200 200 East: 300 300 West: 400 400
因此,您需要计算每个性别/位置对AVAILABLE(gender, location)
可用logging。
find特定的拟合似乎接近semimagic广场 [1] [2] 。
math.stackexchange.com上有几个关于这个[3] [4]的问题 。
我已经读了一些关于如何构build这些文件,我怀疑这是可能的一个select做到这一点。
如果你有足够的logging,并不会在这样的情况下结束:
Male Female --------------------- North: 100 0 South: 200 200 East: 300 0 West: 200 200
我会去迭代槽的位置,并在每一步添加比例的男性/女性数量:
- 男:100(16%); F:0(0%)
- 男:100(16%); F:200(50%)
- 男:400(66%); F:200(50%)
- M:600(100%); F:400(100%)
但是这只会给你一个近似的结果,在validation那些你可能想重复几次的结果之后,在每个类别中调整计数“ 足够好 ”。
我会build立一个数据库的分布图,并用它来实现采样逻辑。 奖金包括向用户添加快速人口统计反馈的可能性,并且不会给服务器带来额外的负担。 另一方面,你需要实现一个机制来保持数据库和地图的同步。
它可以像这样使用JSON:
{"gender":{ "Male":{ "amount":35600, "region":{ "North":{ "amount":25000, "age":{ "18":{ "amount":2400, "ethnicity":{ ... "income":{ ... } }, "income":{ ... "ethnicity":{ ... } } }, "19":{ ... }, ... "120":{ ... } }, "ethnicity":{ ... }, "income":{ ... } }, "South":{ ... }, ... } "age":{ ... } "ethnicity":{ ... }, "income":{ ... } }, "Female":{ ... } }, "region":{ ... }, "age":{ ... }, "ethnicity":{ ... }, "income":{ ... }}
所以用户select
total 1000 600 Male 400 Female 100 North 200 South 300 East 400 West 300 <20 years old 300 21-29 years old 400 >=30 years old
计算线性分布:
male-north-u20: 1000*0.6*0.1*0.3=18 male-north-21to29: 18 male-north-o29: 24 (keep a track of rounding errors) etc
那么我们将检查地图:
tmp.male.north.u20=getSumUnder(JSON.gender.Male.region.North.age,20) // == 10 tmp.male.north.f21to29=getSumBetween(JSON.gender.Male.region.North.age,21,29) // == 29 tmp.male.north.o29=getSumOver(JSON.gender.Male.region.north.age,29) // == 200 etc
标记符合线性分布的所有东西,并logging剩余。 如果(比如male.north.u20)低于父母的第一个调整值(为了确保male.north例如符合标准),那么在u20中缺less8,在f21to29中缺less8。 第一次运行后,在其他地区调整每个缺失的标准。 就像tmp.male.south.u20+=8;tmp.male.south.f21to29-=8;
。
正确的做法非常繁琐。
最后你有正确的分布,可以用来构造一个微不足道的SQL查询。
这可以分两步解决。 我将介绍如何以性别和地区为维度的例子。 那么我将会描述更一般的情况。 在第一步中,我们求解一个8variables方程组,然后我们取8个select语句的不相交联合,这个语句受到第一步中find的解的限制。 请注意,任何行只有8种可能性。 他们可以是男性或女性,然后该地区是北,南,东或西的一个。 现在让我们,
X1 equal the number of rows that are male and from the north, X2 equal the number of rows that are male and from the south, X3 equal the number of rows that are male and from the east, X4 equal then number that are male and from the west X5 equal the number of rows that are female and from the north, X6 equal the number of rows that are female and from the south, X7 equal the number of rows that are female and from the east, X8 equal then number that are female and from the west
方程式是:
X1+X2+X3+X4=600 X5+X6+X7+X8=400 X1+X5=100 X2+X6=200 X3+X7=300 X4+X8=400
现在求解上面的X1,X2,… X8。 有很多解决scheme(我会在一会儿描述如何解决)这里是一个解决scheme:
X1=60, X2=120, X3=180,X4=240,X5=40,X6=80,X7=120,X8=160.
现在我们可以通过8个select的简单联合来得到结果:
(select * from user where gender='m' and region="north" limit 60) union distinct(select * from user where gender='m' and region='south' limit 120) union distinct(select * from user where gender='m' and region='east' limit 180) union distinct(select * from user where gender='m' and region='west' limit 240) union distinct(select * from user where gender='f' and region='north' limit 40) union distinct(select * from user where gender='f' and region='south' limit 80) union distinct(select * from user where gender='f' and region='east' limit 120) union distinct(select * from user where gender='f' and region='west' limit 160);
注意,如果数据库中没有60行满足上面的第一个select,则给定的特定解决scheme将不起作用。 所以我们必须添加其他约束,LT:
0<X1 <= (select count(*) from user where from user where gender='m' and region="north") 0<X2 <= (select count(*) from user where gender='m' and region='south') 0<X3 <= (select count(*) from user where gender='m' and region='east' ) 0<X4 <= (select count(*) from user where gender='m' and region='west') 0<X5 <= (select count(*) from user where gender='f' and region='north' ) 0<X6 <= (select count(*) from user where gender='f' and region='south') 0<X7 <= (select count(*) from user where gender='f' and region='east' ) 0<X8 <= (select count(*) from user where gender='f' and region='west');
现在我们来概括一下这种情况下是否允许分割。 方程是E:
X1+X2+X3+X4=n1 X5+X6+X7+X8=n2 X1+X5=m1 X2+X6=m2 X3+X7=m3 X4+X8=m4
给出了数字n1,n2,m1,m2,m3,m4,满足n1 + n2 =(m1 + m2 + m3 + m4)。 所以我们把问题简化为求解上面的方程LT和E. 这只是一个线性规划问题,可以使用单纯形法或其他方法解决。 另一种可能性是将其视为线性不定方程组的一个系统,并使用它来find解的方法。 在任何情况下,我已经减less了问题find上述方程的解决scheme。 (考虑到方程式是一种特殊的forms,那么使用单纯形法或者求解一个线性不定方程的系统可能会有一个更快的方法)。一旦我们求解Xi,最终的解决scheme是:
(select * from user where gender='m' and region="north" limit :X1) union distinct(select * from user where gender='m' and region='south' limit :X2) union distinct(select * from user where gender='m' and region='east' limit :X3) union distinct(select * from user where gender='m' and region='west' limit :X4) union distinct(select * from user where gender='f' and region='north' limit :X5) union distinct(select * from user where gender='f' and region='south' limit :X6) union distinct(select * from user where gender='f' and region='east' limit :X7) union distinct(select * from user where gender='f' and region='west' limit :X8);
让我们用D表示一个具有n个可能性的维数D. 假设您有D1:n1,D2:n2,… DM:nM维度。 这将产生n1 * n2 * … nMvariables。 生成的方程的数量是n1 + n2 + … nM。 而是定义一般的方法让我们另外一个三维,四维和二维的情况; 可以将D1的可能值设为d11,d12,d13,D2为d21,d22,d23,d24,D3的值为d31,d32。 我们将有24个variables,方程是:
X1 + X2 + ...X8=n11 X9 + X10 + ..X16=n12 X17+X18 + ...X24=n13 X1+X2+X9+x10+x17+x18=n21 X3+X4+X11+x12+x19+x20=n22 X5+X6+X13+x14+x21+x22=n23 X7+X8+X15+x116+x23+x24=n24 X1+X3+X5+...X23=n31 X2+X4+......X24=n32
哪里
X1 equals number with D1=d11 and D2=d21 and D3=d31 X2 equals number with D1=d11 and D2=d21 and D3 = d31 .... X24 equals number with D1=D13 and D2=d24, and D3=d32.
添加less于约束。 然后求解X1,X2,… X24。 创build24select语句并采取不相交的联合。 我们可以解决任何尺寸的问题。
因此,总而言之:给定维数D1:n1,D2:n2,… DM:nM,我们可以解决相应的线性规划问题,如上所述的n1 * n2 * … nMvariables,然后通过取不相交n1 * n2 * … nMselect语句的联合。 所以是的,我们可以通过select语句产生一个解决scheme,但是首先我们必须通过获得n1 * n2 * … nM个variables中的每一个来计算方程和确定极限。
即使赏金已经结束,我将为你感兴趣的人增加一点。 我在这里声称,如果有解决scheme,我已经完全展示了如何解决这个问题。
澄清我的方法。 在三维的情况下,可以说我们将年龄分成三种可能性之一。 然后在问题中充分利用性别和地区。 每个用户有24种不同的可能性,对应于他们在这些类别中的位置。 让Xi是最终结果中每个可能性的数量。 让我写一个matrix,其中每一行代表每种可能性之一。 每个用户将贡献至多1到m或f,1到北,南,东或西,1到年龄类别。 而用户只有24种可能性。 让我们给出一个matrix:(abc)3个年龄段,(nsew)个地区和(mf)个男性或女性:a是年龄小于或等于10岁,b年龄在11岁到30岁之间,c年龄在31岁到50岁之间。
abc nsew mf X1 100 1000 10 X2 100 1000 01 X3 100 0100 10 X4 100 0100 01 X5 100 0010 10 X6 100 0010 01 X7 100 0001 10 X8 100 0001 01 X9 010 1000 10 X10 010 1000 01 X11 010 0100 10 X12 010 0100 01 X13 010 0010 10 X14 010 0010 01 X15 010 0001 10 X16 010 0001 01 X17 001 1000 10 X18 001 1000 01 X19 001 0100 10 X20 001 0100 01 X21 001 0010 10 X22 001 0010 01 X23 001 0001 10 X24 001 0001 01
每一行表示一个用户,如果它对结果有贡献,则列中有一个1。 例如,第一行显示1表示a,1表示n,1表示m。 这意味着用户的年龄小于或等于10岁,是北方的男性。 Xi表示最终结果中有多less行。 因此,让我们说X1是10,这意味着我们说最终的结果有10个结果,所有这些结果都是从北方来的,都是男性,而且都小于或等于10.好吧,现在我们只需要添加一些东西。 请注意,前8个X1+X2+X3+X4+X5+X6+X7+X8
是年龄小于或等于10的所有行。它们必须加起来为我们所select的那个类别。 对于接下来的2套8也是一样。
到目前为止,我们得到方程:(na是年龄小于10的数字,年龄在10到20之间的数字,n是年龄小于50的数字
X1+X2+X3+X4+X5+X6+X7+X8 = na X9+X10+X11 + .... X16 = nb X17+X18+X19+... X24=nc
那是年龄分裂。 现在让我们看看区域拆分。 只需在“n”列中加上variables,
X1+X2+X9+X10+X17+X18 = nn X3+X4+X11+X12+X19+20=ns ...
等等你知道我是如何通过向下看列来获得这些方程吗? 继续新闻和制作。 总共给出3 + 4 + 2个等式。 所以我在这里做的很简单。 我已经推断,你挑选的任何一行对三个维度中的每一个都有贡献,并且只有24个可能性。 然后让Xi为每个可能性的数字,并得到需要解决的方程。 在我看来,无论你提出什么方法,都必须解决这些方程。 换句话说,我只是用解这些方程来重新expression这个问题。
现在我们想要一个整数解决scheme,因为我们不能有一个小数行。 注意这些都是线性方程。 但我们想要一个整数解决scheme。 以下是一篇论文如何解决这些问题的链接: https : //www.math.uwaterloo.ca/~wgilbert/Research/GilbertPathria.pdf
在SQL中形成业务逻辑从来不是一个好主意,因为它会阻碍即使是微小变化的吸收能力。
我的build议是在ORM中执行此操作,并将业务逻辑从SQL中抽象出来。
例如,如果您使用的是Django :
你的模型看起来像:
class User(models.Model): GENDER_CHOICES = ( ('M', 'Male'), ('F','Female') ) gender = models.CharField(max_length=1, choices=GENDER_CHOICES) REGION_CHOICES = ( ('E', 'East'), ('W','West'), ('N','North'), ('S','South') ) region = models.CharField(max_length=1, choices=REGION_CHOICES) age = models.IntegerField() ETHNICITY_CHOICES = ( ....... ) ethnicity = models.CharField(max_length=1, choices=ETHNICITY_CHOICES) income = models.FloatField()
而你的查询function可能是这样的:
# gender_limits is a dict like {'M':400, 'F':600} # region_limits is a dict like {'N':100, 'E':200, 'W':300, 'S':400} def get_users_by_gender_and_region(gender_limits,region_limits): for gender in gender_limits: gender_queryset = gender_queryset | User.objects.filter(gender=gender)[:gender_limits[gender]] for region in region_limits: region_queryset = region_queryset | User.objects.filter(region=region)[:region_limits[region]] return gender_queryset & region_queryset
查询function可以进一步抽象,你知道所有你打算支持的查询,但这应该作为一个例子。
如果你使用的是不同的ORM,同样的想法也可以被翻译成任何好的ORM都会有联合和交叉抽象。
我将使用一种编程语言来生成SQL语句,但下面是纯mySQL的一个解决scheme。 一个假设是:在一个地区总有足够的男性/女性来适应这个数字(例如,如果没有女性生活在北方呢?)。
该例程是预先计算所需的行数量。 限制不能用variables指定。 我更像是一个有分析function的oracle人。 MySQL通过允许variables也提供了一些扩展。 所以我设定了目标地区和性别,并计算了细目。 然后我使用计算来限制输出。
这个查询显示计数来certificate这个概念。
set @male=600; set @female=400; set @north=100; set @south=200; set @east=300; set @west=400; set @north_male=@north*(@male/(@male+@female)); set @south_male=@south*(@male/(@male+@female)); set @east_male =@east *(@male/(@male+@female)); set @west_male =@west *(@male/(@male+@female)); set @north_female=@north*(@female/(@male+@female)); set @south_female=@south*(@female/(@male+@female)); set @east_female =@east *(@female/(@male+@female)); set @west_female =@west *(@female/(@male+@female)); select gender, region, count(*) from ( select * from (select @north_male :=@north_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0 union all select * from (select @south_male :=@south_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0 union all select * from (select @east_male :=@east_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'East' ) me where row>=0 union all select * from (select @west_male :=@west_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'West' ) mw where row>=0 union all select * from (select @north_female:=@north_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0 union all select * from (select @south_female:=@south_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0 union all select * from (select @east_female :=@east_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'East' ) fe where row>=0 union all select * from (select @west_female :=@west_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'West' ) fw where row>=0 ) a group by gender, region order by gender, region;
输出:
Female East 120 Female North 40 Female South 80 Female West 160 Male East 180 Male North 60 Male South 120 Male West 240
Remove the outer part to get the real records:
set @male=600; set @female=400; set @north=100; set @south=200; set @east=300; set @west=400; set @north_male=@north*(@male/(@male+@female)); set @south_male=@south*(@male/(@male+@female)); set @east_male =@east *(@male/(@male+@female)); set @west_male =@west *(@male/(@male+@female)); set @north_female=@north*(@female/(@male+@female)); set @south_female=@south*(@female/(@male+@female)); set @east_female =@east *(@female/(@male+@female)); set @west_female =@west *(@female/(@male+@female)); select * from (select @north_male :=@north_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0 union all select * from (select @south_male :=@south_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0 union all select * from (select @east_male :=@east_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'East' ) me where row>=0 union all select * from (select @west_male :=@west_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'West' ) mw where row>=0 union all select * from (select @north_female:=@north_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0 union all select * from (select @south_female:=@south_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0 union all select * from (select @east_female :=@east_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'East' ) fe where row>=0 union all select * from (select @west_female :=@west_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'West' ) fw where row>=0 ;
For testing I have written a procedure which does create 10000 sample records fully random:
use test; drop table if exists users; create table users (userid int not null auto_increment, gender VARCHAR (20), region varchar(20), primary key (userid) ); drop procedure if exists load_users_table; delimiter # create procedure load_users_table() begin declare l_max int unsigned default 10000; declare l_cnt int unsigned default 0; declare l_gender varchar(20); declare l_region varchar(20); declare l_rnd smallint; truncate table users; start transaction; WHILE l_cnt < l_max DO set l_rnd = floor( 0 + (rand()*2) ); if l_rnd = 0 then set l_gender = 'Male'; else set l_gender = 'Female'; end if; set l_rnd=floor(0+(rand()*4)); if l_rnd = 0 then set l_region = 'North'; elseif l_rnd=1 then set l_region = 'South'; elseif l_rnd=2 then set l_region = 'East'; elseif l_rnd=3 then set l_region = 'West'; end if; insert into users (gender, region) values (l_gender, l_region); set l_cnt=l_cnt+1; end while; commit; end # delimiter ; call load_users_table(); select gender, region, count(*) from users group by gender, region order by gender, region;
Hope this all helps you. The bottom line is: Use a UNION ALL
and restrict with pre-calculated variables not LIMIT
.
Well, I think the question is about randomly getting the records and not in the proportion of 60/40 for all regions. I have done for Region and Gender. It can be generalized to other fields like age, income and ethnicity in the same way.
Declare @Mlimit bigint Declare @Flimit bigint Declare @Northlimit bigint Declare @Southlimit bigint Declare @Eastlimit bigint Declare @Westlimit bigint Set @Mlimit= 600 Set @Flimit=400 Set @Northlimit= 100 Set @Southlimit=200 Set @Eastlimit=300 Set @Westlimit=400 CREATE TABLE #Users( [UserId] [int] NOT NULL, [gender] [varchar](10) NULL, [region] [varchar](10) NULL, [age] [int] NULL, [ethnicity] [varchar](50) NULL, [income] [bigint] NULL ) Declare @MnorthCnt bigint Declare @MsouthCnt bigint Declare @MeastCnt bigint Declare @MwestCnt bigint Declare @FnorthCnt bigint Declare @FsouthCnt bigint Declare @FeastCnt bigint Declare @FwestCnt bigint Select @MnorthCnt=COUNT(*) from users where gender='male' and region='north' Select @FnorthCnt=COUNT(*) from users where gender='female' and region='north' Select @MsouthCnt=COUNT(*) from users where gender='male' and region='south' Select @FsouthCnt=COUNT(*) from users where gender='female' and region='south' Select @MeastCnt=COUNT(*) from users where gender='male' and region='east' Select @FeastCnt=COUNT(*) from users where gender='female' and region='east' Select @MwestCnt=COUNT(*) from users where gender='male' and region='west' Select @FwestCnt=COUNT(*) from users where gender='female' and region='west' If (@Northlimit=@MnorthCnt+@FnorthCnt) begin Insert into #Users select * from Users where region='north' set @Northlimit=0 set @Mlimit-=@MnorthCnt set @Flimit-=@FnorthCnt set @MnorthCnt=0 set @FnorthCnt=0 end If (@Southlimit=@MSouthCnt+@FSouthCnt) begin Insert into #Users select * from Users where region='South' set @Southlimit=0 set @Mlimit-=@MSouthCnt set @Flimit-=@FSouthCnt set @MsouthCnt=0 set @FsouthCnt=0 end If (@Eastlimit=@MEastCnt+@FEastCnt) begin Insert into #Users select * from Users where region='East' set @Eastlimit=0 set @Mlimit-=@MEastCnt set @Flimit-=@FEastCnt set @MeastCnt=0 set @FeastCnt=0 end If (@Westlimit=@MWestCnt+@FWestCnt) begin Insert into #Users select * from Users where region='West' set @Westlimit=0 set @Mlimit-=@MWestCnt set @Flimit-=@FWestCnt set @MwestCnt=0 set @FwestCnt=0 end If @MnorthCnt<@Northlimit Begin insert into #Users select top (@Northlimit-@MnorthCnt) * from Users where gender='female' and region='north' and userid not in (select userid from #users) set @Flimit-=(@Northlimit-@MnorthCnt) set @FNorthCnt-=(@Northlimit-@MnorthCnt) set @Northlimit-=(@Northlimit-@MnorthCnt) End If @FnorthCnt<@Northlimit Begin insert into #Users select top (@Northlimit-@FnorthCnt) * from Users where gender='male' and region='north' and userid not in (select userid from #users) set @Mlimit-=(@Northlimit-@FnorthCnt) set @MNorthCnt-=(@Northlimit-@FnorthCnt) set @Northlimit-=(@Northlimit-@FnorthCnt) End if @MsouthCnt<@southlimit Begin insert into #Users select top (@southlimit-@MsouthCnt) * from Users where gender='female' and region='south' and userid not in (select userid from #users) set @Flimit-=(@southlimit-@MsouthCnt) set @FSouthCnt-=(@southlimit-@MsouthCnt) set @southlimit-=(@southlimit-@MsouthCnt) End if @FsouthCnt<@southlimit Begin insert into #Users select top (@southlimit-@FsouthCnt) * from Users where gender='male' and region='south' and userid not in (select userid from #users) set @Mlimit-=(@southlimit-@FsouthCnt) set @MSouthCnt-=(@southlimit-@FsouthCnt) set @southlimit-=(@southlimit-@FsouthCnt) End if @MeastCnt<@eastlimit Begin insert into #Users select top (@eastlimit-@MeastCnt) * from Users where gender='female' and region='east' and userid not in (select userid from #users) set @Flimit-=(@eastlimit-@MeastCnt) set @FEastCnt-=(@eastlimit-@MeastCnt) set @eastlimit-=(@eastlimit-@MeastCnt) End if @FeastCnt<@eastlimit Begin insert into #Users select top (@eastlimit-@FeastCnt) * from Users where gender='male' and region='east' and userid not in (select userid from #users) set @Mlimit-=(@eastlimit-@FeastCnt) set @MEastCnt-=(@eastlimit-@FeastCnt) set @eastlimit-=(@eastlimit-@FeastCnt) End if @MwestCnt<@westlimit Begin insert into #Users select top (@westlimit-@MwestCnt) * from Users where gender='female' and region='west' and userid not in (select userid from #users) set @Flimit-=(@westlimit-@MwestCnt) set @FWestCnt-=(@westlimit-@MwestCnt) set @westlimit-=(@westlimit-@MwestCnt) End if @FwestCnt<@westlimit Begin insert into #Users select top (@westlimit-@FwestCnt) * from Users where gender='male' and region='west' and userid not in (select userid from #users) set @Mlimit-=(@westlimit-@FwestCnt) set @MWestCnt-=(@westlimit-@FwestCnt) set @westlimit-=(@westlimit-@FwestCnt) End IF (@MnorthCnt>=@Northlimit and @FnorthCnt>=@Northlimit and @MsouthCnt>=@southlimit and @FsouthCnt>=@southlimit and @MeastCnt>=@eastlimit and @FeastCnt>=@eastlimit and @MwestCnt>=@westlimit and @FwestCnt>=@westlimit and not(@Mlimit=0 and @Flimit=0)) Begin ---Create Cursor DECLARE UC CURSOR FAST_forward FOR SELECT * FROM Users where userid not in (select userid from #users) Declare @UserId [int] , @gender [varchar](10) , @region [varchar](10) , @age [int] , @ethnicity [varchar](50) , @income [bigint] OPEN UC FETCH NEXT FROM UC INTO @UserId ,@gender, @region, @age, @ethnicity, @income WHILE @@FETCH_STATUS = 0 and not (@Mlimit=0 and @Flimit=0) BEGIN If @gender='male' and @region='north' and @Northlimit>0 AND @Mlimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @Mlimit-=1 set @MNorthCnt-=1 set @Northlimit-=1 end If @gender='male' and @region='south' and @southlimit>0 AND @Mlimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @Mlimit-=1 set @MsouthCnt-=1 set @Southlimit-=1 end If @gender='male' and @region='east' and @eastlimit>0 AND @Mlimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @Mlimit-=1 set @MeastCnt-=1 set @eastlimit-=1 end If @gender='male' and @region='west' and @westlimit>0 AND @Mlimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @Mlimit-=1 set @MwestCnt-=1 set @westlimit-=1 end If @gender='female' and @region='north' and @Northlimit>0 AND @flimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @Flimit-=1 set @FNorthCnt-=1 set @Northlimit-=1 end If @gender='female' and @region='south' and @southlimit>0 AND @flimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @Flimit-=1 set @FsouthCnt-=1 set @Southlimit-=1 end If @gender='female' and @region='east' and @eastlimit>0 AND @flimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @flimit-=1 set @feastCnt-=1 set @eastlimit-=1 end If @gender='female' and @region='west' and @westlimit>0 AND @flimit>0 begin insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income) set @flimit-=1 set @fwestCnt-=1 set @westlimit-=1 end FETCH NEXT FROM UC INTO @UserId ,@gender, @region, @age, @ethnicity, @income END CLOSE UC DEALLOCATE UC end Select * from #Users SELECT GENDER, REGION, COUNT(*) AS COUNT FROM #USERS GROUP BY GENDER, REGION DROP TABLE #Users
I expect you'd want to generate a bunch of queries based on the required filters.
I'll explain a possible approach, with a full code sample – but note the caveats later on.
I'll also address the issue where you can't fulfil the requested sample from a proportional distribution, but you can from an adjusted distribution – and explain how to do that adjustment
The basic algorithm goes like this:
Start with a set of filters {F1, F2, ... Fn}
, each which has a group of values, and percentages which should be distributed amongst those values. For example F1 might be gender, with 2 values (F1V1 = Male: 60%, F1V2 = Female: 40%) You'll also want the total sample size required (call this X
) From this starting point you can then combine all the filters items from each filter to get a single set all of the combined filter items, and the quantities required for each. The code should be able to handle any number of filters, with any number of values (either exact values, or ranges)
EG: suppose 2 filters, F1: gender, {F1V1 = Male: 60%, F1V2 = Female: 40%}, F2: region, {F2V1 = North: 50%, F2V2 = South: 50%} and a total sample required of X = 10 people.
In this sample we'd like 6 of them to be male, 4 of them to be female, 5 to be from the north, and 5 to be from the south.
Then we do
- Create an sql stub for each value in F1 – with an associated fraction of the initial percentage (ie
-
WHERE gender = 'Male'
: 0.6, -
WHERE gender = 'Female'
: 0.4 )
-
- For each item in F2 – create a new sql stub from every item from the step above – with the filter now being both the F1 Value & the F2 Value, and the associated fraction being the product of the 2 fractions. So we now have 2 x 2 = 4 items of
-
WHERE gender = 'Male' AND region = 'North'
: 0.6 * 0.5 = 0.3, -
WHERE gender = 'Female' AND region = 'North'
: 0.4 * 0.5 = 0.2, -
WHERE gender = 'Male' AND region = 'South'
: 0.6*0.5 = 0.3, -
WHERE gender = 'Female' AND region = 'South'
: 0.4*0.5 = 0.2
-
- Repeat step 2 above for every additional Filter F3 to Fn. (in our example there were only 2 filters, so we are already done)
- Calculate the limit for each SQL stub as being [fraction associated with stub] * X = total required sample size (so for our example thats 0.3 * 10 = 3 for Male/North, 0.2 * 10 = 2 for Female/North etc)
- Finally for every sql stub – turn it into a complete SQL statement , and add the limit
代码示例
I'll provide C# code for this, but it should be easy enough to translate this to other languages. It would be pretty tricky to attempt this in pure dynamic SQL
Note this is untested – and probably full of errors – but its an idea of the approach you could take.
I've defined a public method and a public class – which would be the entry point.
// This is an example of a public class you could use to hold one of your filters // For example - if you wanted 60% male / 40% female, you could have an item with // item1 = {Fraction: 0.6, ValueExact: 'Male', RangeStart: null, RangeEnd: null} // & item2 = {Fraction: 0.4, ValueExact: 'Female', RangeStart: null, RangeEnd: null} public class FilterItem{ public decimal Fraction {get; set;} public string ValueExact {get; set;} public int? RangeStart {get; set;} public int? RangeEnd {get; set;} } // This is an example of a public method you could call to build your SQL // - passing in a generic list of desired filter // for example the dictionary entry for the above filter would be // {Key: "gender", Value: new List<FilterItem>(){item1, item2}} public string BuildSQL(Dictionary<string, List<FilterItem>> filters, int TotalItems) { // we want to build up a list of SQL stubs that can be unioned together. var sqlStubItems = new List<SqlItem>(); foreach(var entry in filters) { AddFilter(entry.Key, entry.Value, sqlStubItems); } // ok - now just combine all of the sql stubs into one big union. var result = ""; // Id use a stringbuilder for this normally, // but this is probably more cross-language readable. int limitSum = 0; for(int i = 0; i < sqlStubItems.Count; i++) // string.Join() would be more succinct! { var item = sqlStubItems[i]; if (i > 0) { result += " UNION "; } int limit = (int)Math.Round(TotalItems * item.Fraction, 0); limitSum+= limit; if (i == sqlStubItems.Count - 1 && limitSum != TotalItems) { //may need to adjust one of the rounded items to account //for rounding errors making a total that is not the //originally required total limit. limit += (TotalItems - limitSum); } result += item.Sql + " LIMIT " + Convert.ToString(limit); } return result; } // This method expands the number of SQL stubs for every filter that has been added. // each existing filter is split by the number of items in the newly added filter. private void AddFilter(string filterType, List<FilterItem> filterValues, List<SqlItem> SqlItems) { var newItems = new List<SqlItem>(); foreach(var filterItem in filterValues) { string filterAddon; if (filterItem.RangeStart.HasValue && filterItem.RangeEnd.HasValue){ filterAddon = filterType + " >= " + filterItem.RangeStart.ToString() + " AND " + filterType + " <= " + filterItem.RangeEnd.ToString(); } else { filterAddon = filterType + " = '" + filterItem.ValueExact.Replace("'","''") + "'"; //beware of SQL injection. (hence the .Replace() above) } if(SqlItems.Count() == 0) { newItems.Add(new SqlItem(){Sql = "Select * FROM users WHERE " + filterAddon, Fraction = filterItem.Fraction}); } else { foreach(var existingItem in SqlItems) { newItems.Add(new SqlItem() { Sql = existingItem + " AND " + filterAddon, Fraction = existingItem.Fraction * filterItem.Fraction }); } } } SqlItems.Clear(); SqlItems.AddRange(newItems); } // this class is for part-built SQL strings, with the fraction private class SqlItem{ public string Sql { get; set;} public decimal Fraction{get; set;} }
Notes (as per comment by Sign)
- Rounding errors may mean you don't get exactly the 600 / 400 split you were aiming for when applying a large number of filters – but should be close.
- If your dataset is not very diverse then it may not be possible to always generate the required split. This method will require an even distribution amongst the filters (so if you were doing a total of 10 people, 6 male, 4 female , 5 from the north, 5 from the south it would require 3 males from the north, 3 males from the south, 2 females from the north and 2 females from the south.)
- The people are not going to be retrieved at random – just whatever the default sort is. You would need to add something like ORDER BY RAND() (but not that as its VERY inefficient) to get a random selection.
- Beware of SQL injection. Sanitise all user input, replacing single quote
'
chars.
Badly distributed sample problem
How do you address the problem of there being insufficient items in one of our buckets to create our sample as per a representative split (that the above algorithm gives)? Or what if your numbers are not integers?
Well I won't go so far as to provide code, but I will describe a possible approach. You'd need to alter the code above quite a bit, because a flat list of sql stubs isn't going to cut it anymore. Instead you'd need to build a n-dimensional matrix of SQL stubs (adding a dimension for every filter F1 – n) After step 4 above has been completed (where we have our desired, but not necessarily possible numbers for each SQL stub item), what I'd expect to do is
- generate SQL to select counts for all the combined sql WHERE stubs.
- Then you'd iterate the collection – and if you hit an item where the requested limit is higher than the count (or not an integer),
- adjust the requested limit down to the count (or nearest integer).
- Then pick another item on each of the axis that is at least the above adjustment lower that its max count, and adjust it up by the same. If its not possible to find qualifying items then your requested split is not possible.
- Then adjust all the intersecting items for the upward adjusted items down again
- Repeat the step above for intersects between the intersecting points for every additional dimension to n (but toggle the adjustment between negative and positive each time)
So suppose continuing our previous example – our representative split is:
Male/North = 3, Female/North = 2, Male/South = 3, Female/South = 2, but there are only 2 Males in the north (but theres loads of people in the other groups we could pick)
- We adjust Male/North down to 2 (-1)
- We adjust Female/North to 3 (+1) and Male/South to 4 (+1)
- We adjust the Intersecting Female/South to 1 (-1). 瞧! (there are no additional dimensions as we only had 2 criteria/dimensions)
This illustration may be helpful when adjusting intersecting items in higher dimensions (only showing up to 4 dimensions, but should help to picture what needs to be done! Each point represents one of our SQL stub items in the n-dimensional matrix (and has an associated limit number) A line represents a common criteria value (such as gender = male). The objective is that the total along any line should remain the same after adjustments have finished! We start with the red point, and continue for each additional dimension… In the example above we would only be looking at 2 dimensions – a square formed from the red point, the 2 orange points above and to the right of it, and the 1 green point to the NE to complete the square.
I'd go with GROUP BY
:
SELECT gender,region,count(*) FROM users GROUP BY gender,region
+----------------------+ |gender|region|count(*)| +----------------------+ |f |E | 129| |f |N | 43| |f |S | 84| |f |W | 144| |m |E | 171| |m |N | 57| |m |S | 116| |m |W | 256| +----------------------+
You can verify you have 600 males, 400 females, 100 North, 200 South, 300 East and 400 West.
You can include other fields as well.
For range fields, like age and income, you can follow this example:
SELECT gender, region, case when age < 30 then 'Young' when age between 30 and 59 then 'Middle aged' else 'Old' end as age_range, count(*) FROM users GROUP BY gender,region, age_range
So, the results would be like:
+----------------------------------+ |gender|region|age |count(*)| +----------------------------------+ |f |E |Middle aged| 56| |f |E |Old | 31| |f |E |Young | 42| |f |N |Middle aged| 14| |f |N |Old | 11| |f |N |Young | 18| |f |S |Middle aged| 40| |f |S |Old | 23| |f |S |Young | 21| |f |W |Middle aged| 67| |f |W |Old | 42| |f |W |Young | 35| |m |E |Middle aged| 77| |m |E |Old | 56| |m |E |Young | 38| |m |N |Middle aged| 13| |m |N |Old | 25| |m |N |Young | 19| |m |S |Middle aged| 46| |m |S |Old | 39| |m |S |Young | 31| |m |W |Middle aged| 103| |m |W |Old | 66| |m |W |Young | 87| +----------------------------------+