How to filter hits by sub-aggregated results in Elasticsearch
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id
field on the _source. There's a sequence
field starting at 0
, so that the highest sequence per id
is the latest event for that id
. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status
matches READY
.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY
.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1
and one for event_chain-2
when I only want the one for event_chain-2
Terms agg size
is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
add a comment |
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id
field on the _source. There's a sequence
field starting at 0
, so that the highest sequence per id
is the latest event for that id
. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status
matches READY
.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY
.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1
and one for event_chain-2
when I only want the one for event_chain-2
Terms agg size
is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
add a comment |
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id
field on the _source. There's a sequence
field starting at 0
, so that the highest sequence per id
is the latest event for that id
. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status
matches READY
.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY
.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1
and one for event_chain-2
when I only want the one for event_chain-2
Terms agg size
is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
I've been implementing an event sourcing solution backed in elasticsearch. Documents represent state change events, linked by id
field on the _source. There's a sequence
field starting at 0
, so that the highest sequence per id
is the latest event for that id
. In practice additional data will be available only on the first event and subsequent events will contain only the fields that have changed. The goal was to have an index I never have to send updates to, only inserts.
Trying to create a query that will returns the first and last events grouped by their id, if an only if their latest event's status
matches READY
.
Sample data:
[
"_index":"events",
"_type":"event",
"_id":"AWcFf2N-IqNGd75vWMgc",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"SENT",
"sequence":1,
"timestamp":"1541985493824",
"export_batch_id":"103709fe-959f-4b4e-8255-ef59f18a3cf6"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWMf6",
"_score":1,
"_source":
"id":"event_chain-1",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
,
"_index":"events",
"_type":"event",
"_id":"AWbQomwoIqNGd75vWabc",
"_score":1,
"_source":
"id":"event_chain-2",
"status":"READY",
"sequence":"0",
"timestamp":"2018-10-31T00:00:00Z"
]
I wrote a terms aggregation on the id.keyword field, and two top_hits sub aggregations to get the first and latest events by ordering on the sequence and grabbing the top and bottom result respectively.
Problem is any matching I do on the status happens before the aggregations, and I need a way to exclude from the terms aggregation results any hits where the latest_event's status is the one that doesn't match READY
.
What I have so far:
POST /events/_search
"size": 0,
"query":
"bool":
"must":
"match":
"status": "READY"
,
"aggs":
"group_by_id":
"terms":
"field": "id.keyword",
"order":
"_term": "asc"
,
"size": 100
,
"aggs":
"latest_event":
"top_hits":
"sort": [
"sequence":
"order": "desc"
],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": [
"sequence":
"order": "asc"
],
"from": 0,
"size": 1
,
"num_ready":
"cardinality":
"field": "id.keyword"
This would return two terms, one for event_chain-1
and one for event_chain-2
when I only want the one for event_chain-2
Terms agg size
is so this query can be run in scheduled batches, always scraping the top of the results and updating the chains so they don't come up in the next query.
elasticsearch event-sourcing
elasticsearch event-sourcing
asked Nov 12 '18 at 3:03
Justin ReevesJustin Reeves
543318
543318
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
add a comment |
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
1
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00
add a comment |
1 Answer
1
active
oldest
votes
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits
, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits
, max
to find the max
sequence per id
, and filter
aggregations at the same level, then another max
aggregation on the filter
aggregations to find the max
sequence per id
only for each result that is in status READY
, assuming all events sharing an id have at least one event in READY
status, then using bucket_selector
aggregation to select the relevant set based max
and filter
results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53255462%2fhow-to-filter-hits-by-sub-aggregated-results-in-elasticsearch%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits
, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits
, max
to find the max
sequence per id
, and filter
aggregations at the same level, then another max
aggregation on the filter
aggregations to find the max
sequence per id
only for each result that is in status READY
, assuming all events sharing an id have at least one event in READY
status, then using bucket_selector
aggregation to select the relevant set based max
and filter
results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
add a comment |
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits
, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits
, max
to find the max
sequence per id
, and filter
aggregations at the same level, then another max
aggregation on the filter
aggregations to find the max
sequence per id
only for each result that is in status READY
, assuming all events sharing an id have at least one event in READY
status, then using bucket_selector
aggregation to select the relevant set based max
and filter
results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
add a comment |
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits
, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits
, max
to find the max
sequence per id
, and filter
aggregations at the same level, then another max
aggregation on the filter
aggregations to find the max
sequence per id
only for each result that is in status READY
, assuming all events sharing an id have at least one event in READY
status, then using bucket_selector
aggregation to select the relevant set based max
and filter
results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
I dug deep on this and tried to look at it. I think it came down to the limitations of the individual aggregations. Can't do a sub-agg on top_hits
, so I needed some other way to filter the results that came back.
I eventually found someone doing something similar: https://rahulsinghai.blogspot.com/2016/07/elasticsearch-pipeline-bucket-selector.html
Enter combining top_hits
, max
to find the max
sequence per id
, and filter
aggregations at the same level, then another max
aggregation on the filter
aggregations to find the max
sequence per id
only for each result that is in status READY
, assuming all events sharing an id have at least one event in READY
status, then using bucket_selector
aggregation to select the relevant set based max
and filter
results.
Potential Solution:
POST /events/_search
"size": 0,
"aggs":
"grouped_by_id":
"terms":
"field": "id.keyword",
"size": 100,
"order": "max_seq":"desc"
,
"aggs":
"max_seq": "max":"field":"sequence",
"latest_event":
"top_hits":
"sort": ["sequence":"order":"desc"],
"from": 0,
"size": 1
,
"first_event":
"top_hits":
"sort": ["sequence":"order":"asc"],
"from": 0,
"size": 1
,
"filters":
"filter": "bool":"must":["match":"status":"READY"],
"aggs":
"latest_ready_seq": "max":"field":"sequence"
,
"should_we_consider":
"bucket_selector":
"buckets_path":
"latest_seq": "max_seq",
"latest_ready_seq": "filters>latest_ready_seq"
,
"script": "params.latest_seq == params.latest_ready_seq"
answered Nov 12 '18 at 19:17
Justin ReevesJustin Reeves
543318
543318
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53255462%2fhow-to-filter-hits-by-sub-aggregated-results-in-elasticsearch%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Not related to Elasticsearch, but I our apps number of events per aggregate usually is less than 100, so we just load them all and filter afterwards.
– Roman Eremin
Nov 12 '18 at 8:26
This one's write heavy, read on schedule, so our batches have to be tunable and reasonably small data-wise (which makes filtering after the query potentially hard-to-impossible since there's no garauntee you'll get any expected hits per batch if the sum of filtered and not filtered is greater than the batch size).
– Justin Reeves
Nov 12 '18 at 19:00