How to filter hits by sub-aggregated results in Elasticsearch
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id field on the _source. There's a sequence field starting at 0, so that the highest sequence per id is the latest event for that id. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2
Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
add a comment |
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id field on the _source. There's a sequence field starting at 0, so that the highest sequence per id is the latest event for that id. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2
Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
add a comment |
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id field on the _source. There's a sequence field starting at 0, so that the highest sequence per id is the latest event for that id. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2
Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id field on the _source. There's a sequence field starting at 0, so that the highest sequence per id is the latest event for that id. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status matches READY.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1 and one for event_chain-2 when I only want the one for event_chain-2
Terms agg size is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
elasticsearch event-sourcing
asked Nov 12 '18 at 3:03
Justin ReevesJustin Reeves
543318
543318
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
add a comment |
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
1
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
add a comment |
1 Answer
1
active
oldest
votes
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits, max to find the max sequence per id, and filter aggregations at the same level, then another max aggregation on the filter aggregations to find the max sequence per id only for each result that is in status READY, assuming all events sharing an id have at least one event in READY status, then using bucket_selector aggregation to select the relevant set based max and filter results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53255462%2fhow-to-filter-hits-by-sub-aggregated-results-in-elasticsearch%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits, max to find the max sequence per id, and filter aggregations at the same level, then another max aggregation on the filter aggregations to find the max sequence per id only for each result that is in status READY, assuming all events sharing an id have at least one event in READY status, then using bucket_selector aggregation to select the relevant set based max and filter results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
add a comment |
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits, max to find the max sequence per id, and filter aggregations at the same level, then another max aggregation on the filter aggregations to find the max sequence per id only for each result that is in status READY, assuming all events sharing an id have at least one event in READY status, then using bucket_selector aggregation to select the relevant set based max and filter results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
add a comment |
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits, max to find the max sequence per id, and filter aggregations at the same level, then another max aggregation on the filter aggregations to find the max sequence per id only for each result that is in status READY, assuming all events sharing an id have at least one event in READY status, then using bucket_selector aggregation to select the relevant set based max and filter results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits, max to find the max sequence per id, and filter aggregations at the same level, then another max aggregation on the filter aggregations to find the max sequence per id only for each result that is in status READY, assuming all events sharing an id have at least one event in READY status, then using bucket_selector aggregation to select the relevant set based max and filter results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
answered Nov 12 '18 at 19:17
Justin ReevesJustin Reeves
543318
543318
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53255462%2fhow-to-filter-hits-by-sub-aggregated-results-in-elasticsearch%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00