Python script uses all RAM

Python script uses all RAM



I have a Python script that is used to parse emails from large documents. This script is using all my RAM on my machine and makes it lock up to where I have to restart it. I was wondering if there is a way I can limit this or maybe even have a pause after it gets done reading one file and providing some output. Any help would be great thank you.


#!/usr/bin/env python

# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
# - Does not save to file (pipe the output to a file if you want it saved).
# Twitter @Critical24 - DefensiveThinking.io


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+/=?^_`~-]+(?:.[a-z0-9!#$%&'*+/=?^_`"
"~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(.|"
"sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
"""Returns the contents of filename as a string."""
with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
"""Returns an iterator of matched emails found in string s."""
# Removing lines that start with '//' because the regular expression
# mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

import os
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files
for file in files:
_,file_ext = os.path.splitext(file)#Here we get the extension of the file
file_path = os.path.join(root,file)
if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
print("File %s is not parseble"%file_path)
continue #This one continues the loop to the next file
if os.path.isfile(file_path):
for email in get_emails(file_to_str(file_path)):
print(email)






How large are those files? Unless your pattern can span multiple lines, you could try reading the files line-by-line and applying it to each line, i.e. use file f as a generator instead of using read or readlines.

– tobias_k
Sep 14 '18 at 15:07



f


read


readlines






Also, I just noticed that your comment says that the script extracts from "plain text files", but .txt is in your list of non-parseable files. Should that be a list of parseable files instead?

– tobias_k
Sep 14 '18 at 15:10


.txt






i found your problem "([a-z0-9!#$%&'*+/=?^_~-]+(?:.[a-z0-9!#$%&'*+/=?^_" "~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"

– Nathan McCoy
Sep 14 '18 at 15:13


"([a-z0-9!#$%&'*+/=?^_


" "~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"






Some of the indentation in the code in the question is broken.

– Bryan Oakley
Sep 14 '18 at 15:20






In a worst case, if you have an 8 gig file and read it into memory, you're using 8 gigs of memory (plus a bit of overhead). If you then try to parse that and return the parsed data, that could easily result in another 8 gigs of memory being used.

– Bryan Oakley
Sep 14 '18 at 15:23





2 Answers
2



I think you should try this resource module:


import resource
resource.setrlimit(resource.RLIMIT_AS, (megs * 1048576L, -1L))






Thank you. I will give this a try.

– Alex
Sep 14 '18 at 15:13






also see this stackoverflow.com/questions/1760025/limit-python-vm-memory

– Umer
Sep 14 '18 at 15:17



It seems like you are reading files with up to 8 GB into memory, using f.read(). Instead, you could try applying the regex to each line of the file, without ever having the entire file in memory.


f.read()


with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
return (email[0] for line in f
for email in re.findall(regex, line.lower())
if not email[0].startswith('//'))



This can still take a very long time, though. Also, I did not check your regex for possible problems.



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

How do I collapse sections of code in Visual Studio Code for Windows?

ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ