Parsing URL correctly
Hi!
I'm currently trying to utilize the Sumologic SDK for Python with one of my queries and after testing my script, I noticed that it could not resolve the hostname for all the sites.
Looking into the query, I get urls with executables, zips, and even characters that unicode doesn't properly decipher (examples: "down11246.yzzzn.com/index.html?%2F152926%2Fveryhuo1%2F���������������������������������������������������� (...)", "
13100.url.tudown.com/down/KMSpico(KMS????ForWin7%2FWin10%2FOffice2016)v11.2.1?????@240_101328. (...)", and "10553.url.7wkw.com/down/javascript????????%"). My question is how can I remove everything after the top level domain (I just want sitename.com and nothing else). Currently, I start the parsing as "_sourceCategory=Threat/Malc0de | parse "URL: *, IP Address: *, Country: *, ASN: *, MD5: *</description>" as url,ip,country,asn,md5" Any suggestions would be greatly appreciated! |
-
Official comment
Hi Mark,
Great question! We can do this in Sumo Logic through the use of the parse regex operator like so:
| parse regex "(?<url_base>.*\.[a-zA-Z0-9]*\/)"
This will create a new column on the fly called "url_base" which will contain only the top level domain and the .com/ or .io/ or .org/, etc. you get the picture :)
If you want to get rid of the "/" at the end after applying parse regex, you can add another line to your search statement so it looks like this:
| parse regex "(?<url_base>.*\.[a-zA-Z0-9]*\/)"
| parse field=url_base "*/" as url
The second line extracts your top level domain from "url_base" without the slash and saves it out as another column called "url". This is a great example of applying regular expressions to match difficult patterns in your logs.
If you don't feel like seeing the "url_base" column after getting "url" in your results, you can use:
| fields - url_base
to strike that column from your schema after using it to get the "url" column.
Please let us know if this works for you!
Cheers,
Jason
Comment actions -
Hi Jason,
Thanks for your response! If I'm parsing the URL without regex, is there a way for me to extract the top level domain with parse field? Currently, I have the query starting as follows:
_sourceCategory=Threat/Malc0de
| parse "URL: *, IP Address: *, Country: *, ASN: *, MD5: *</description>" as url,ip,country,asn,md5
| parse field=url "*/" as url_base
This is not correctly parsing url.EDIT: Jason, this works, I just didn't update everything the following lines to be url_base. Thanks for your help!
Best,
Mark
Please sign in to leave a comment.
Comments
3 comments