How to filter hits by sub-aggregated results in Elasticsearch

I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id field on the _source. There's a sequence field starting at 0, so that the highest sequence per id is the latest event for that id. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.

Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.

Sample data:

[ 
 
 "_index":"events",
 "_type":"event",
 "_id":"AWcFf2N-IqNGd75vWMgc",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"SENT",
 "sequence":1,
 "timestamp":"1541985493824",
 "export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWMf6",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWabc",
 "_score":1,
 "_source": 
 "id":"event_chain-2",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 
]

I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.

Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY.

What I have so far:

POST /events/_search

 "size": 0,
 "query": 
 "bool": 
 "must": 
 "match": 
 "status": "READY"
 
 
 
 ,
 "aggs": 
 "group_by_id": 
 "terms": 
 "field": "id.keyword",
 "order": 
 "_term": "asc"
 ,
 "size": 100
 ,
 "aggs": 
 "latest_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "desc"
 
 
 ],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "asc"
 
 
 ],
 "from": 0,
 "size": 1
 
 
 
 ,
 "num_ready": 
 "cardinality": 
 "field": "id.keyword"

This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2

Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.

asked Nov 12 '18 at 3:03

Justin Reeves

543318

1

Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.

– Roman Eremin
Nov 12 '18 at 8:26

This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).

– Justin Reeves
Nov 12 '18 at 19:00

add a comment |

Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.

Sample data:

[ 
 
 "_index":"events",
 "_type":"event",
 "_id":"AWcFf2N-IqNGd75vWMgc",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"SENT",
 "sequence":1,
 "timestamp":"1541985493824",
 "export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWMf6",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWabc",
 "_score":1,
 "_source": 
 "id":"event_chain-2",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 
]

What I have so far:

POST /events/_search

 "size": 0,
 "query": 
 "bool": 
 "must": 
 "match": 
 "status": "READY"
 
 
 
 ,
 "aggs": 
 "group_by_id": 
 "terms": 
 "field": "id.keyword",
 "order": 
 "_term": "asc"
 ,
 "size": 100
 ,
 "aggs": 
 "latest_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "desc"
 
 
 ],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "asc"
 
 
 ],
 "from": 0,
 "size": 1
 
 
 
 ,
 "num_ready": 
 "cardinality": 
 "field": "id.keyword"

This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2

Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.

asked Nov 12 '18 at 3:03

Justin Reeves

543318

1

Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.

– Roman Eremin
Nov 12 '18 at 8:26

This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).

– Justin Reeves
Nov 12 '18 at 19:00

add a comment |

Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.

Sample data:

[ 
 
 "_index":"events",
 "_type":"event",
 "_id":"AWcFf2N-IqNGd75vWMgc",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"SENT",
 "sequence":1,
 "timestamp":"1541985493824",
 "export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWMf6",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWabc",
 "_score":1,
 "_source": 
 "id":"event_chain-2",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 
]

What I have so far:

POST /events/_search

 "size": 0,
 "query": 
 "bool": 
 "must": 
 "match": 
 "status": "READY"
 
 
 
 ,
 "aggs": 
 "group_by_id": 
 "terms": 
 "field": "id.keyword",
 "order": 
 "_term": "asc"
 ,
 "size": 100
 ,
 "aggs": 
 "latest_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "desc"
 
 
 ],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "asc"
 
 
 ],
 "from": 0,
 "size": 1
 
 
 
 ,
 "num_ready": 
 "cardinality": 
 "field": "id.keyword"

This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2

Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.

asked Nov 12 '18 at 3:03

Justin Reeves

543318

Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.

Sample data:

[ 
 
 "_index":"events",
 "_type":"event",
 "_id":"AWcFf2N-IqNGd75vWMgc",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"SENT",
 "sequence":1,
 "timestamp":"1541985493824",
 "export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWMf6",
 "_score":1,
 "_source": 
 "id":"event_chain-1",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 ,
 
 "_index":"events",
 "_type":"event",
 "_id":"AWbQomwoIqNGd75vWabc",
 "_score":1,
 "_source": 
 "id":"event_chain-2",
 "status":"READY",
 "sequence":"0",
 "timestamp":"2018-10-31T00:00:00Z"
 
 
]

What I have so far:

POST /events/_search

 "size": 0,
 "query": 
 "bool": 
 "must": 
 "match": 
 "status": "READY"
 
 
 
 ,
 "aggs": 
 "group_by_id": 
 "terms": 
 "field": "id.keyword",
 "order": 
 "_term": "asc"
 ,
 "size": 100
 ,
 "aggs": 
 "latest_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "desc"
 
 
 ],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": [
 
 "sequence": 
 "order": "asc"
 
 
 ],
 "from": 0,
 "size": 1
 
 
 
 ,
 "num_ready": 
 "cardinality": 
 "field": "id.keyword"

This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2

Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.

elasticsearch event-sourcing

asked Nov 12 '18 at 3:03

Justin Reeves

543318

asked Nov 12 '18 at 3:03

Justin Reeves

543318

asked Nov 12 '18 at 3:03

Justin Reeves

543318

asked Nov 12 '18 at 3:03

Justin Reeves

543318

asked Nov 12 '18 at 3:03

Justin Reeves

543318

1

Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.

– Roman Eremin
Nov 12 '18 at 8:26

This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).

– Justin Reeves
Nov 12 '18 at 19:00

add a comment |

1

Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.

– Roman Eremin
Nov 12 '18 at 8:26

This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).

– Justin Reeves
Nov 12 '18 at 19:00

Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.

– Roman Eremin
Nov 12 '18 at 8:26

This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).

– Justin Reeves
Nov 12 '18 at 19:00

add a comment |

1 Answer
1

active

oldest

votes

I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits, so I needed some other way to filter the results that came back.

I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html

Enter combining top_hits, max to find the max sequence per id, and filter aggregations at the same level, then another max aggregation on the filter aggregations to find the max sequence per id only for each result that is in status READY, assuming all events sharing an id have at least one event in READY status, then using bucket_selector aggregation to select the relevant set based max and filter results.

Potential Solution:

POST /events/_search

 "size": 0,
 "aggs": 
 "grouped_by_id": 
 "terms": 
 "field": "id.keyword",
 "size": 100,
 "order": "max_seq":"desc"
 ,
 "aggs": 
 "max_seq": "max":"field":"sequence",
 "latest_event": 
 "top_hits": 
 "sort": ["sequence":"order":"desc"],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": ["sequence":"order":"asc"],
 "from": 0,
 "size": 1
 
 ,
 "filters": 
 "filter": "bool":"must":["match":"status":"READY"],
 "aggs": 
 "latest_ready_seq": "max":"field":"sequence"
 
 ,
 "should_we_consider": 
 "bucket_selector": 
 "buckets_path": 
 "latest_seq": "max_seq",
 "latest_ready_seq": "filters>latest_ready_seq"
 ,
 "script": "params.latest_seq == params.latest_ready_seq"

answered Nov 12 '18 at 19:17

Justin Reeves

543318

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53255462%2fhow-to-filter-hits-by-sub-aggregated-results-in-elasticsearch%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html

Potential Solution:

POST /events/_search

 "size": 0,
 "aggs": 
 "grouped_by_id": 
 "terms": 
 "field": "id.keyword",
 "size": 100,
 "order": "max_seq":"desc"
 ,
 "aggs": 
 "max_seq": "max":"field":"sequence",
 "latest_event": 
 "top_hits": 
 "sort": ["sequence":"order":"desc"],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": ["sequence":"order":"asc"],
 "from": 0,
 "size": 1
 
 ,
 "filters": 
 "filter": "bool":"must":["match":"status":"READY"],
 "aggs": 
 "latest_ready_seq": "max":"field":"sequence"
 
 ,
 "should_we_consider": 
 "bucket_selector": 
 "buckets_path": 
 "latest_seq": "max_seq",
 "latest_ready_seq": "filters>latest_ready_seq"
 ,
 "script": "params.latest_seq == params.latest_ready_seq"

answered Nov 12 '18 at 19:17

Justin Reeves

543318

add a comment |

I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html

Potential Solution:

POST /events/_search

 "size": 0,
 "aggs": 
 "grouped_by_id": 
 "terms": 
 "field": "id.keyword",
 "size": 100,
 "order": "max_seq":"desc"
 ,
 "aggs": 
 "max_seq": "max":"field":"sequence",
 "latest_event": 
 "top_hits": 
 "sort": ["sequence":"order":"desc"],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": ["sequence":"order":"asc"],
 "from": 0,
 "size": 1
 
 ,
 "filters": 
 "filter": "bool":"must":["match":"status":"READY"],
 "aggs": 
 "latest_ready_seq": "max":"field":"sequence"
 
 ,
 "should_we_consider": 
 "bucket_selector": 
 "buckets_path": 
 "latest_seq": "max_seq",
 "latest_ready_seq": "filters>latest_ready_seq"
 ,
 "script": "params.latest_seq == params.latest_ready_seq"

answered Nov 12 '18 at 19:17

Justin Reeves

543318

add a comment |

I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html

Potential Solution:

POST /events/_search

 "size": 0,
 "aggs": 
 "grouped_by_id": 
 "terms": 
 "field": "id.keyword",
 "size": 100,
 "order": "max_seq":"desc"
 ,
 "aggs": 
 "max_seq": "max":"field":"sequence",
 "latest_event": 
 "top_hits": 
 "sort": ["sequence":"order":"desc"],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": ["sequence":"order":"asc"],
 "from": 0,
 "size": 1
 
 ,
 "filters": 
 "filter": "bool":"must":["match":"status":"READY"],
 "aggs": 
 "latest_ready_seq": "max":"field":"sequence"
 
 ,
 "should_we_consider": 
 "bucket_selector": 
 "buckets_path": 
 "latest_seq": "max_seq",
 "latest_ready_seq": "filters>latest_ready_seq"
 ,
 "script": "params.latest_seq == params.latest_ready_seq"

answered Nov 12 '18 at 19:17

Justin Reeves

543318

I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html

Potential Solution:

POST /events/_search

 "size": 0,
 "aggs": 
 "grouped_by_id": 
 "terms": 
 "field": "id.keyword",
 "size": 100,
 "order": "max_seq":"desc"
 ,
 "aggs": 
 "max_seq": "max":"field":"sequence",
 "latest_event": 
 "top_hits": 
 "sort": ["sequence":"order":"desc"],
 "from": 0,
 "size": 1
 
 ,
 "first_event": 
 "top_hits": 
 "sort": ["sequence":"order":"asc"],
 "from": 0,
 "size": 1
 
 ,
 "filters": 
 "filter": "bool":"must":["match":"status":"READY"],
 "aggs": 
 "latest_ready_seq": "max":"field":"sequence"
 
 ,
 "should_we_consider": 
 "bucket_selector": 
 "buckets_path": 
 "latest_seq": "max_seq",
 "latest_ready_seq": "filters>latest_ready_seq"
 ,
 "script": "params.latest_seq == params.latest_ready_seq"

answered Nov 12 '18 at 19:17

Justin Reeves

543318

answered Nov 12 '18 at 19:17

Justin Reeves

543318

answered Nov 12 '18 at 19:17

Justin Reeves

543318

answered Nov 12 '18 at 19:17

Justin Reeves

543318

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt