有效的方式来检索ElasticSearch中的所有_ids
从ElasticSearch获得某个索引的所有_ids的最快方法是什么? 有可能通过使用简单的查询吗? 我的一个索引有大约20000个文件。
编辑:请阅读@Aleck Landgraf的答案
你只需要elasticsearch-internal _id
字段? 或从您的文件中的id
字段?
对于前者,试试
curl http://localhost:9200/index/type/_search?pretty=true -d ' { "query" : { "match_all" : {} }, "fields": [] } '
结果将只包含文档的“元数据”
{ "took" : 7, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 4, "max_score" : 1.0, "hits" : [ { "_index" : "index", "_type" : "type", "_id" : "36", "_score" : 1.0 }, { "_index" : "index", "_type" : "type", "_id" : "38", "_score" : 1.0 }, { "_index" : "index", "_type" : "type", "_id" : "39", "_score" : 1.0 }, { "_index" : "index", "_type" : "type", "_id" : "34", "_score" : 1.0 } ] } }
对于后者,如果要从文档中包含字段,只需将其添加到fields
数组
curl http://localhost:9200/index/type/_search?pretty=true -d ' { "query" : { "match_all" : {} }, "fields": ["document_field_to_be_returned"] } '
最好使用滚动和扫描来获得结果列表,以便elasticsearch不必对结果进行sorting和sorting。
通过elasticsearch-dsl
python lib,可以通过以下方式来完成:
from elasticsearch import Elasticsearch from elasticsearch_dsl import Search es = Elasticsearch() s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE) s = s.fields([]) # only get ids, otherwise `fields` takes a list of field names ids = [h.meta.id for h in s.scan()]
控制台日志:
GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s] GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s] GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s] GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s] GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s] ...
注意 : 滚动从查询中提取批量的结果,并保持光标打开一段时间(1分钟,2分钟,你可以更新); 扫描禁用sorting。 scan
助手函数返回一个可以安全地迭代的python生成器。
另外一个select
curl 'http://localhost:9200/index/type/_search?pretty=true&fields='
将返回_index,_type,_id和_score。
对于elasticsearch 5.x,可以使用“ _source ”字段。
GET /_search { "_source": false, "query" : { "term" : { "user" : "kimchy" } } }
"fields"
已被弃用。 (错误:“字段[字段]不再支持,请使用[stored_fields]检索存储的字段或_source过滤如果字段没有存储”)
受@ Aleck-Landgraf答案的启发,对我来说,它是通过在标准的elasticsearch python API中使用直接扫描function来实现的:
from elasticsearch import Elasticsearch from elasticsearch.helpers import scan es = Elasticsearch() for dobj in scan(es, query={"query": {"match_all": {}}, "fields" : []}, index="your-index-name", doc_type="your-doc-type"): print dobj["_id"],
你也可以用python来做,它给你一个正确的列表:
import elasticsearch es = elasticsearch.Elasticsearch() res = es.search( index=your_index, body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]}) ids = [d['_id'] for d in res['hits']['hits']]
对@ Robert-Lujo和@ Aleck-Landgraf的2个答案进行详细说明(有权限的人可以很高兴地将其转换为注释):如果您不想打印,但是从返回的生成器中获取列表中的所有内容,我用:
from elasticsearch import Elasticsearch,helpers es = Elasticsearch(hosts=[YOUR_ES_HOST]) a=helpers.scan(es,query={"query":{"match_all": {}}},scroll='1m',index=INDEX_NAME)#like others so far IDs=[aa['_id'] for aa in a]
Url -> http://localhost:9200/<index>/<type>/_query http method -> DELETE Query -> {"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})