Removing Special Characters / Punctuation for the end of a Python List of URL's

I am writing a Python code to extract all the URLs from an input file, having content or text from Twitter (Tweets). However, while doing so I realized that several URLs that were extracted in the python list had 'special characters' or 'Punctuation' towards the end, because of which I could not further parse through them to get the base URL link. My Question is: 'How do I identify & remove special characters from the end of every URL in my list' ?

Current Output:

['https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u'', 'https://twitter.com/GVNyqWEu5u@#', 'https://twitter.com/GVNyqWEu5u"']

Desired Output:

['https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u']

You would appreciate that not all elements in the 'Current Output' list have special characters / punctuation towards the end. The task is to identify & remove characters / punctuation only from the list elements who have them.

I am using the following Regex to extract twitter URLs from the Tweet Text: lst = re.findall('(http.?://[^s]+)', text)
Can I remove the special characters / punctuation towards the end of the URL, in this step itself ?

lst = re.findall('(http.?://[^s]+)', text)

Full Code:

import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup from socket import timeout import ssl import re import csv ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE count = 0 file = "Test.CSV" with open(file,'r', encoding='utf-8') as f, open('output_themes_1.csv', 'w', newline='', encoding='utf-8') as ofile: next(f) reader = csv.reader(f) writer = csv.writer(ofile) fir = 'S.No.', 'Article_Id', 'Validity', 'Content', 'Geography', 'URL' writer.writerow(fir) for line in reader: count = count+1 text = line[5] lst = re.findall('(http.?://[^s]+)', text) if not lst: x = count, line[0], 'Empty List', text, line[8], line[6] print (x) writer.writerow(x) else: try: for url in lst: try: html = urllib.request.urlopen(url, context=ctx, timeout=60).read() #html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read() soup = BeautifulSoup(html, 'html.parser') title = soup.title.string str_title = str (title) if 'Twitter' in str_title: if len(lst) > 1: break else: continue else: y = count, line[0], 'Parsed', str_title, line[8], url print (y) writer.writerow(y) except UnicodeEncodeError as e: b_url = url.encode('ascii', errors='ignore') n_url = b_url.decode("utf-8") try: html = urllib.request.urlopen(n_url, context=ctx, timeout=90).read() soup = BeautifulSoup(html, 'html.parser') title = soup.title.string str_title = str (title) if 'Twitter' in str_title: if len(lst) > 1: break else: continue else: z = count, line[0], 'Parsed_2', str_title, line[8], url print (z) writer.writerow(z) except Exception as e: a = count, line[0], str(e), text, line[8], url print (a) writer.writerow(a) except Exception as e: b = count, line[0], str(e), text, line[8], url print (b) writer.writerow(b) print ('Total Rows Analyzed:', count)

Define a list of special characters then remove the last character if it is in your list and start again
– E.Serra
Aug 31 at 15:14

Are the special characters always at the end?
– UnbearableLightness
Aug 31 at 15:18

3 Answers
3

Assuming the special characters occur at the end of the string you may use:

mydata = ['https://twitter.com/GVNyqWEu5u', "https://twitter.com/GVNyqWEu5u'", 'https://twitter.com/GVNyqWEu5u@#', 'https://twitter.com/GVNyqWEu5u"'] mydata = [re.sub('[^a-zA-Z0-9]+$','',item) for item in mydata] print(mydata)

Prints:

['https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u']

Sorry for asking a naive question, but how does this work with my code scenario below:
– Ayush Saxena
Aug 31 at 16:27

1. lst = re.findall('(http.?://[^s]+)', text) 2. lst = [re.sub('[^a-zA-Z0-9]+$','',item) for item in lst] . This doesn't seem to work. After I have a list from Step 1., I should look at removing special characters / punctuation from step 2.
– Ayush Saxena
Aug 31 at 16:28

Add all your code to your question please.
– UnbearableLightness
Aug 31 at 16:34

Added the full code, please check now.
– Ayush Saxena
Aug 31 at 17:17

In one of the row i am getting the URL: twitter.com/va0JsrIavm', which ofcourse can't be parsed further because of the ' at the end of the URL.
– Ayush Saxena
Aug 31 at 17:25

Assuming your list is called urls:

def remove_special_chars(url, char_list=None): if char_list is None: # Build your own default list here char_list = ['#', '%'] for character in char_list: if url.endswith(character): return remove_special_chars(url[:-1], char_list) return url urls = [remove_special_chars(url) for url in urls]

If you want to get rid of a special set of characters just change either the default value or pass a proper list as an argument

Added the full code, please check the problem now. Also note that, i am facing most of my problems with " & ', how do you precisely handle them ?
– Ayush Saxena
Aug 31 at 17:18

In one of the row i am getting the URL: twitter.com/va0JsrIavm', which ofcourse can't be parsed further because of the ' at the end of the URL.
– Ayush Saxena
Aug 31 at 17:25

add those characters in char_list, in the case of ' add it as "'" (with double quotes), the function will keep on trimming from the right until there is no special character at the end of the url.
– E.Serra
Sep 3 at 12:47

You could try this -

lst = [re.sub('[=" ]$', '', i) for i in re.findall('(http.?://[^s]+)', text)]

You can just add more characters that you want to replace in your sub according to your requirements

Sadly. This Doesn't seem to work. Would you need more info. to de-bug ?
– Ayush Saxena
Aug 31 at 16:31

Added the full code, please check the problem now.
– Ayush Saxena
Aug 31 at 17:18

In one of the row i am getting the URL: twitter.com/va0JsrIavm', which ofcourse can't be parsed further because of the ' at the end of the URL.
– Ayush Saxena
Aug 31 at 17:25

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt