Removing Special Characters / Punctuation for the end of a Python List of URL's

Removing Special Characters / Punctuation for the end of a Python List of URL's



I am writing a Python code to extract all the URLs from an input file, having content or text from Twitter (Tweets). However, while doing so I realized that several URLs that were extracted in the python list had 'special characters' or 'Punctuation' towards the end, because of which I could not further parse through them to get the base URL link. My Question is: 'How do I identify & remove special characters from the end of every URL in my list' ?



Current Output:


['https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u'', 'https://twitter.com/GVNyqWEu5u@#', 'https://twitter.com/GVNyqWEu5u"']



Desired Output:


['https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u']



You would appreciate that not all elements in the 'Current Output' list have special characters / punctuation towards the end. The task is to identify & remove characters / punctuation only from the list elements who have them.



I am using the following Regex to extract twitter URLs from the Tweet Text: lst = re.findall('(http.?://[^s]+)', text)
Can I remove the special characters / punctuation towards the end of the URL, in this step itself ?


lst = re.findall('(http.?://[^s]+)', text)



Full Code:


import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
from socket import timeout
import ssl
import re
import csv

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

count = 0
file = "Test.CSV"
with open(file,'r', encoding='utf-8') as f, open('output_themes_1.csv', 'w', newline='', encoding='utf-8') as ofile:
next(f)
reader = csv.reader(f)
writer = csv.writer(ofile)
fir = 'S.No.', 'Article_Id', 'Validity', 'Content', 'Geography', 'URL'
writer.writerow(fir)
for line in reader:
count = count+1
text = line[5]
lst = re.findall('(http.?://[^s]+)', text)
if not lst:
x = count, line[0], 'Empty List', text, line[8], line[6]
print (x)
writer.writerow(x)
else:
try:
for url in lst:
try:
html = urllib.request.urlopen(url, context=ctx, timeout=60).read()
#html = urllib.request.urlopen(urllib.parse.quote(url, errors='ignore'), context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
str_title = str (title)
if 'Twitter' in str_title:
if len(lst) > 1: break
else: continue
else:
y = count, line[0], 'Parsed', str_title, line[8], url
print (y)
writer.writerow(y)
except UnicodeEncodeError as e:
b_url = url.encode('ascii', errors='ignore')
n_url = b_url.decode("utf-8")
try:
html = urllib.request.urlopen(n_url, context=ctx, timeout=90).read()
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
str_title = str (title)
if 'Twitter' in str_title:
if len(lst) > 1: break
else: continue
else:
z = count, line[0], 'Parsed_2', str_title, line[8], url
print (z)
writer.writerow(z)
except Exception as e:
a = count, line[0], str(e), text, line[8], url
print (a)
writer.writerow(a)
except Exception as e:
b = count, line[0], str(e), text, line[8], url
print (b)
writer.writerow(b)
print ('Total Rows Analyzed:', count)





Define a list of special characters then remove the last character if it is in your list and start again
– E.Serra
Aug 31 at 15:14





Are the special characters always at the end?
– UnbearableLightness
Aug 31 at 15:18




3 Answers
3



Assuming the special characters occur at the end of the string you may use:


mydata = ['https://twitter.com/GVNyqWEu5u', "https://twitter.com/GVNyqWEu5u'", 'https://twitter.com/GVNyqWEu5u@#', 'https://twitter.com/GVNyqWEu5u"']
mydata = [re.sub('[^a-zA-Z0-9]+$','',item) for item in mydata]
print(mydata)



Prints:


['https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u', 'https://twitter.com/GVNyqWEu5u']





Sorry for asking a naive question, but how does this work with my code scenario below:
– Ayush Saxena
Aug 31 at 16:27





1. lst = re.findall('(http.?://[^s]+)', text) 2. lst = [re.sub('[^a-zA-Z0-9]+$','',item) for item in lst] . This doesn't seem to work. After I have a list from Step 1., I should look at removing special characters / punctuation from step 2.
– Ayush Saxena
Aug 31 at 16:28






Add all your code to your question please.
– UnbearableLightness
Aug 31 at 16:34






Added the full code, please check now.
– Ayush Saxena
Aug 31 at 17:17





In one of the row i am getting the URL: twitter.com/va0JsrIavm', which ofcourse can't be parsed further because of the ' at the end of the URL.
– Ayush Saxena
Aug 31 at 17:25



Assuming your list is called urls:


def remove_special_chars(url, char_list=None):
if char_list is None:
# Build your own default list here
char_list = ['#', '%']
for character in char_list:
if url.endswith(character):
return remove_special_chars(url[:-1], char_list)
return url

urls = [remove_special_chars(url) for url in urls]



If you want to get rid of a special set of characters just change either the default value or pass a proper list as an argument





Added the full code, please check the problem now. Also note that, i am facing most of my problems with " & ', how do you precisely handle them ?
– Ayush Saxena
Aug 31 at 17:18





In one of the row i am getting the URL: twitter.com/va0JsrIavm', which ofcourse can't be parsed further because of the ' at the end of the URL.
– Ayush Saxena
Aug 31 at 17:25





add those characters in char_list, in the case of ' add it as "'" (with double quotes), the function will keep on trimming from the right until there is no special character at the end of the url.
– E.Serra
Sep 3 at 12:47



You could try this -


lst = [re.sub('[=" ]$', '', i) for i in re.findall('(http.?://[^s]+)', text)]



You can just add more characters that you want to replace in your sub according to your requirements





Sadly. This Doesn't seem to work. Would you need more info. to de-bug ?
– Ayush Saxena
Aug 31 at 16:31





Added the full code, please check the problem now.
– Ayush Saxena
Aug 31 at 17:18





In one of the row i am getting the URL: twitter.com/va0JsrIavm', which ofcourse can't be parsed further because of the ' at the end of the URL.
– Ayush Saxena
Aug 31 at 17:25



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Some of your past answers have not been well-received, and you're in danger of being blocked from answering.



Please pay close attention to the following guidance:



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)