What would be the correct approach to find and format an informally written date in a string?
What would be the correct approach to find and format an informally written date in a string?
I have a small project where I have dates written informally in a firebase database as Strings. I would need to turn these dates into a datetime format. This is assuming the dates are valid. The input in the website cannot be modified (that's the whole premise of solving this problem). It is also assumed the months are written by name, and the months in YYYY format.
I categorized some examples of these entries:
Almost explicit:
"13 july 2017", "3 december 2008", "december 2016", "21 december 2017", "2020", "24 november 2017", "14 march 2012", "22 july 2016", "20 february 2012", "17 january 2011", "obtained july 2013"...
These would be relatively easy. The year is always a 4-digit number, the date a 2-digit number, and the month a string. These can be checked/compared, and if they miss the exact date (1 or 2 digits), I can assume it can be put as the first (day of the month), same goes for no specified months (assumed january.)
Other easy strings:
"When the group is ready", "Not specified", "Canceled", "set."
These contain no date, so they are unimportant. Just checking for no numbers gets the job done. These could be set as 00-00-0000, doesn't matter.
Harder ones:
"Within 12 or 24 months", "Q4 2013", "Late autumn 2019"
I was thinking:
If I read "within", I calculate the current date, plus the number it follows of (days, months, years).
If I define Q (quarters) as a map
Q=1:(1,2,3), 2:(4,5,6), 3:(7,8,9), 4:(10,11,12)
Q=1:(1,2,3), 2:(4,5,6), 3:(7,8,9), 4:(10,11,12)
and the words
when = "Early" : 1, "Mid": 2, "Late": 3
Same would go for seasons, too.
I could access, for example the 4th element of Q for the fourth quarter and get its months, or late spring as seasons["spring"] and get the month of "early, mid, late".
Back to the question:
I'm not sure if I should approach this using regular expressions, maps and so many comparisons, which will introduce a lot of errors, or use some kind of AI, but I'm not too proficient in the subject yet.
So, back to the question, how would you approach this kind of problem?
Is there any library that may be of help?
I feel like there isn't a very simple implementation here, or the approaches I know of are inefficient and take lots of hard coding.
Yes, that’s essentially factorizing what I proposed. It’s less legible but probably could shorten some lines of code. What I’m looking for is: A more efficient approach if there is one, and if there isn’t, why. Also, other more experienced programmer’s opinion.
– JulianP
Sep 11 '18 at 3:09
The problem here is This is assuming the dates are valid. Some of the dates provided in the question are 'not valid' e.g. december 2016 is 'not a valid date'. If you can provide what the criteria for what is for a 'valid date' it may lead to a solution but as it stands, it just to broad.
– Jay
Sep 22 '18 at 12:48
1 Answer
1
Here's what I did. In case somebody needs a problem like this solved.
This code can be optimized but I just didn't have the time atm.
Assumes: date is in format [anything] DD [anything] month (written) [anything] YYYY
Assumes: If the day is not specified, number 1 is preferred. Same for months.
import re
def validate_date(d,m,y):
leap = False
valid = False
y = int(y)
d = int(d)
m = int(m)
if(y % 400 == 0):
leap=True
if(y % 100 == 0):
leap = False
if(y % 4 == 0):
leap = True
if (m<13):
if (m == 1 | m==3 | m== 5 | m==7 | m==8 | m==10 | m==12 ):
if (d <=31):
valid=True
if (m == 4 | m==6 | m==9 | m==11 ):
if (d <= 30):
valid = True
if (m==2):
if (leap == True):
if (d <= 29):
valid = True
if (leap == False):
if (d <= 28):
valid = True
return valid
months =
1: "januari",
2: "februari",
3: "mars",
4: "april",
5: "maj",
6: "juni",
7: "juli",
8: "augusti",
9: "september",
10: "oktober",
11: "november",
12: "december"
def validate(date):
month=0
day=0
year=0
raw_date = str(date)
#get the integers in the text
raw_date = raw_date.lower()
if (not ("q1" in raw_date or "q2" in raw_date or
"q3" in raw_date or "q4" in raw_date)):
if (len(re.findall('d+', raw_date))==2):
day, year = map(int, re.findall('d+', raw_date))
elif (len(re.findall('d+', raw_date))==3):
day, month, year = re.findall('d+', raw_date)
else:
if (len(re.findall('d+', raw_date))==2):
quarter, year = map(int, re.findall('d+', raw_date))
day = 1
#Set month to the string found, if any.
if (month==0):
if (months.get(1) in raw_date or ("q1" in raw_date)): month = 1
elif(months.get(2) in raw_date): month = 2
elif(months.get(3) in raw_date): month = 3
elif(months.get(4) in raw_date or ("q2" in raw_date)): month = 4
elif(months.get(5) in raw_date): month = 5
elif(months.get(6) in raw_date): month = 6
elif(months.get(7) in raw_date or ("q3" in raw_date)): month = 7
elif(months.get(8) in raw_date): month = 8
elif(months.get(9) in raw_date): month = 9
elif(months.get(10) in raw_date or ("q4" in raw_date)): month = 10
elif(months.get(11) in raw_date): month = 11
elif(months.get(12) in raw_date): month = 12
else: month = 1
clean_date = str(str(day)+"-"+str(month)+"-"+str(year))
# print (clean_date)
if (validate_date(day,month,year)==False): return "00-00-0000"
else: return str(clean_date)
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Could you use a multiple groups of regular expressions and then make each group a list of certain type and then perform modifications to the dates that need it (Q4, 12 or 24 months)?
– vash_the_stampede
Sep 11 '18 at 3:03