awk: extract string from a field [closed]

awk: extract string from a field [closed]



in the input fields are separated by pipe sign:


CCCC|Sess C1|s1 DA=yy07:@##;/u/t/we
DDDDD|Sess C2|s4 DB=yy8:@##;/u/ba



I want to get output where last field is changed (extracted only what is between first = and : in this field



expected output is:


CCCC|Sess C1|yy07
DDDDD|Sess C2|yy8



Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.






What do you mean by "I want to get output where last field is changed"? What exactly defines the expected output? Is it the part before the second |, plus the part between = and :? Please edit your question to add this information.

– Sparhawk
Sep 10 '18 at 12:23



|


=


:






output columns are separated also with | (pipe) - only in the last column I need to print only what is between first = and first : in original last column

– Chris
Sep 10 '18 at 12:28




3 Answers
3



standard awk is not very good at extracting data out of fields based on patterns. Some options include:


awk


split()


match()


RSTART


RLENGTH


subtr()



So here:


awk -F'|' -v OFS='|' '
split($3, a, /[=:]/) >= 2 print $1, $2, a[2]' < file.txt



So returns the portion between the first and second occurrence of a = or : in $3.


=


:


$3



Or:


awk -F'|' -v OFS='|' '
match($3, /=[^:]*/)
print $1, $2, substr($3, RSTART+1, RLENGTH-1)
' < file.txt



GNU awk has a gensub() extension which brings the functionality of sed's s command into awk:


awk


gensub()


sed


s


awk


gawk -F'|' -v OFS='|' '
$3 ~ /=/
print $1, $2, gensub(/^[^=]*=([^:]*).*/, "\1", 1, $3)
' < file.txt



Looks for = followed by any number of non-:s and extracts the part after =. The problem with gensub() is that you can't easily tell if the substitution was successful or not, hence the check that $3 contains = first.


=


:


=


gensub()


$3


=



With sed:


sed


sed -n 's/^([^|]*|[^|]*|)[^=|]*=([^:|]*).*/12/p' < file.txt



With perl:


perl


perl -F'[|]' -lane 'print "$F[0]|$F[1]|$1" if $F[2] =~ /=([^:]*)/' < file.txt






Damn, you were faster. I tried with gawk: awk -F '|' -v OFS='|' 'print $1,$2,gensub(/^[^=]*=([^:]*).*$/, "\1", "1", $3)' < file.txt which is pretty much the same as your suggestion.

– rexkogitans
Sep 10 '18 at 14:04


awk -F '|' -v OFS='|' 'print $1,$2,gensub(/^[^=]*=([^:]*).*$/, "\1", "1", $3)' < file.txt






@rexkogitans, thanks. made me realise that my using of $3 = gensub(... as the condition was wrong.

– Stéphane Chazelas
Sep 10 '18 at 14:28


$3 = gensub(...






The OP does not mention a condition for the 3rd column at all. I assume they are all formatted like this, so I suggest to drop the condition for the main block at all.

– rexkogitans
Sep 10 '18 at 15:06






@rexkogitans, I've all made them to print only those 3 fields if the 3rd field of input was in the expected format. I'll leave it as is unless the OP clarifies what to do when the input is not in the expected format.

– Stéphane Chazelas
Sep 10 '18 at 15:33



I would try


awk -F| 'BEGIN ";
col=index($3,":");
equ=index($3,"=");
$3=substr($3,equ+1,col-equ-1);
print ; ' se



where


-F|


|


equ=index($3,"=");


$3=substr($3,equ+1,col-equ-1);



The first sub removes the first sixth characters in field 3 and second sub
removes everything after colon including.


awk -F| 'sub(/.6/,"",$3)sub(/:.*/,"")1' OFS=| file

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)