Python script uses all RAM

I have a Python script that is used to parse emails from large documents. This script is using all my RAM on my machine and makes it lock up to where I have to restart it. I was wondering if there is a way I can limit this or maybe even have a pause after it gets done reading one file and providing some output. Any help would be great thank you.

#!/usr/bin/env python # Extracts email addresses from one or more plain text files. # # Notes: # - Does not save to file (pipe the output to a file if you want it saved). # - Does not check for duplicates (which can easily be done in the terminal). # - Does not save to file (pipe the output to a file if you want it saved). # Twitter @Critical24 - DefensiveThinking.io from optparse import OptionParser import os.path import re regex = re.compile(("([a-z0-9!#$%&'*+/=?^_`~-]+(?:.[a-z0-9!#$%&'*+/=?^_`" "~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)")) def file_to_str(filename): """Returns the contents of filename as a string.""" with open(filename, encoding='utf-8') as f: #Added encoding='utf-8' return f.read().lower() # Case is lowered to prevent regex mismatches. def get_emails(s): """Returns an iterator of matched emails found in string s.""" # Removing lines that start with '//' because the regular expression # mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'. return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//')) import os not_parseble_files = ['.txt', '.csv'] for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files for file in files: _,file_ext = os.path.splitext(file)#Here we get the extension of the file file_path = os.path.join(root,file) if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files' print("File %s is not parseble"%file_path) continue #This one continues the loop to the next file if os.path.isfile(file_path): for email in get_emails(file_to_str(file_path)): print(email)

How large are those files? Unless your pattern can span multiple lines, you could try reading the files line-by-line and applying it to each line, i.e. use file f as a generator instead of using read or readlines.

– tobias_k
Sep 14 '18 at 15:07

f

read

readlines

Also, I just noticed that your comment says that the script extracts from "plain text files", but .txt is in your list of non-parseable files. Should that be a list of parseable files instead?

– tobias_k
Sep 14 '18 at 15:10

.txt

i found your problem "([a-z0-9!#$%&'*+/=?^_~-]+(?:.[a-z0-9!#$%&'*+/=?^_" "~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"

– Nathan McCoy
Sep 14 '18 at 15:13

"([a-z0-9!#$%&'*+/=?^_

" "~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(.|" "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"

Some of the indentation in the code in the question is broken.

– Bryan Oakley
Sep 14 '18 at 15:20

In a worst case, if you have an 8 gig file and read it into memory, you're using 8 gigs of memory (plus a bit of overhead). If you then try to parse that and return the parsed data, that could easily result in another 8 gigs of memory being used.

– Bryan Oakley
Sep 14 '18 at 15:23

2 Answers
2

I think you should try this resource module:

import resource resource.setrlimit(resource.RLIMIT_AS, (megs * 1048576L, -1L))

Thank you. I will give this a try.

– Alex
Sep 14 '18 at 15:13

also see this stackoverflow.com/questions/1760025/limit-python-vm-memory

– Umer
Sep 14 '18 at 15:17

It seems like you are reading files with up to 8 GB into memory, using f.read(). Instead, you could try applying the regex to each line of the file, without ever having the entire file in memory.

f.read()

with open(filename, encoding='utf-8') as f: #Added encoding='utf-8' return (email[0] for line in f for email in re.findall(regex, line.lower()) if not email[0].startswith('//'))

This can still take a very long time, though. Also, I did not check your regex for possible problems.

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt