Merging multiple average values without calculating total

Merging multiple average values without calculating total



I currently have multiple [Average, Count] pairs from serialized data. User wants the ability to merge(group) some sets of values together and get the aggregated result.


[Average, Count]



I am like its easy, I will just do Sum(Average * Count) / Sum(Count)


Sum(Average * Count) / Sum(Count)



But the problem is, some of the values are very large, its causing arithmetic overflow if I sum all of them.



Is there a way to merge the average part without calculating the total? Count part is pretty obvious.




2 Answers
2



Assuming that Count and Average are indexed values, you can compute your aggregate average this way:


Count


Average


TotalCount = Sum(Count)
TotalAverage = Sum(Average * (Count/TotalCount))



If you want to calculate the values in a single iteration over your serialized data, you can sum successive weighted averages in a manner that looks like exponential averages.


TotalCount = 0
TotalAverage = 0
for each index in data-set of [Average, Count]
TotalCount = TotalCount + Count[index]
Weight = Count[index]/TotalCount
TotalAverage = TotalAverage * (1 - Weight)
+ Average[index] * Weight



You can derive the right approach by considering the first two pairs.



If there was only the first pair:


TotalCount = Count[1]
TotalAverage = Average[1]



But, if there are two pairs:


TotalCount = Count[1] + Count[2]
TotalAverage = Average[1] * (Count[1]/TotalCount)
+ Average[2] * (Count[2]/TotalCount)



If we were iterating from the first pair into the second pair, then the two pair calculation could look like:


TotalCount = TotalCount + Count[2]
TotalAverage = TotalAverage * (TotalCount - Count[2])/TotalCount
+ Average[2] * (Count[2]/TotalCount)



If we let Weight represent Count[2]/TotalCount, the above simplifies to:


Weight


Count[2]/TotalCount


TotalCount = TotalCount + Count[2]
Weight = Count[2]/TotalCount
TotalAverage = TotalAverage * (1 - Weight)
+ Average[2] * Weight



Since TotalCount and TotalAverage is correct at each step that takes on a new pair of the serialized data, the [2] can be replaced with an iteration index.


TotalCount


TotalAverage


[2]






how can i forget this simple math...my elementary teacher would be so disappointed of me

– Steve
Sep 7 '18 at 21:16




While answer by @jxh is good and solve your problem, his and your original approach does two passes over pairs data (first for total count, then for average), which could harm performance. You could do it in one pass, doing rolling average. It could be used even if pairs are coming from the stream, and you don't know how many of them are here



Some Python code:


data = [(3.1, 12), (5.2, 17), (9.7, 11)]

total_count = 0
total_avg = 0.0
for avg, count in data:
n0 = total_count
total_count += count

p = float(n0) / float(total_count)
total_avg = p*total_avg + (1.0 - p)*avg

print(total_count)
print(total_avg)



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)