Find Number of Nodes not in a particular pattern [Neo4j]

Find Number of Nodes not in a particular pattern [Neo4j]



I want to find all user nodes that don't have intermediate connections to other user nodes.



Basic graph is like this :


(User)-->(id)



where a User node can have multiple IDs and some Ids have multiple user nodes. There is a particular pattern of interest


Match (u:User)-->(id)<--(u2:User)
where u <> u2 AND id(u) < id(u2)
return u, id, u2



I want to find all users NOT in this pattern. That is, Find the distinct number of Users that don't have a connection to another User. I am on a big 244 RAM machine but everything I tried just kills the connection over the web interface. The graph contains 755MM Users and 2B Nodes+ overall.



Here is they query that breaks


Match (u:User)
Where not ((u)-->()<--(:User)
RETURN count(distinct User)



I will take any solution to this including APOC if it works. enter image description here




1 Answer
1



The main problem with your current query, is that the expansion is killing you. According to the explain, on average, every user is connected to about 2 ids. So Neo4j needs to quadruple the memory it is using for saved rows (X2 #OfRows and X2 #OfColumnsPerRow in the intermediate table before it can filter on connected to another user) (Disclaimer: Cypher does not guarantee this behavior, this is just a possible plan that can occur)



So you want to design your query in a way to minimize the number of nodes Neo4j needs to touch to get the correct results. In your case, your query is the equivalent of asking how many users have no id, or are connected to an id that is exclusive to that user? (As these 2 conditions are mutually exclusive, we don't need to worry about overlap)



Assuming id nodes have the label :ID...


MATCH (id:ID)
WHERE SIZE((:User)-->(id))=1
WITH COUNT(id) as count
MATCH (u:User)
WHERE NOT (u)-->(:ID)
RETURN count+COUNT(u) as count



Note that edges and labels are indexed internally, so doing it this way, Neo4j doesn't need to actually expand anything. It only needs to scan-by-label to get the nodes, and scan the edge table for the number of edges (and check the labels on the connected nodes if need be. This behavior is not guaranteed by Cypher)



One other way that might work depending on how often the pattern occurs, is to find the nodes in the pattern, and subtract that from the total count. Which way is more efficent depends on how often the pattern occurs vs doesn't occur.


MATCH (id:ID)
WHERE SIZE((id)<--(:User)) > 1
MATCH (id)<--(u:User)
RETURN COUNT((:User))-COUNT(DISTINCT u) as count



Your other problem is the shear volume of nodes you need to possibly count, and depending on what all is running on the server, and what is allocated to Neo4j, this could be a bit beyond Cypher (apoc may have something, but I didn't find anything for pattern counting). One way way to get around this is to page the results (Using LIMIT and OFFSET), and then programaticly query the count of each page and sum it together (this has a small chance of error based on frequency of updates, but is better than nothing)



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌