Regex parsing problems
I tried to come up with a better title but i had no luck.
im looking to do an analysis on our _sourceCategory but want to do a count of only the first 4 segments
There are little limits for that field, so all i can say is its split by a /
I was able to build a POC online with a regex like this
^((.+?)\/){0,3}(.+?[$\/])
Which basically says find between 0 and 3 groups that are "any char" up until a / and then find one and only one group that is 'any char up until a / or $ (end of line)'
Which seems sound, however apparently sumo handles the groupings different than any online expression tool I found.
In fact even if i try to simplify it to something like this
^(?<cat>.+?\/){0,3}
It comes back picking only the second item in the list
a/b/c/d/e/f -> b/
how it came up with that is terribly confusing to me however it does make a little sense because the name is on the inside group, so the 0 to 3 part wouldnt apply (i dont think?), but if i move that inside, i need to group it differently and then sumo pukes
(?<cat>(.+?\/){0,3})
Which oddly grabs the first char of each _sourceCategory
-
Official comment
If your sourcecategory has such a predictable pattern, I would avoid regex entirely and just use this:
| split _sourcecategory delim='/' extract 1 as part1, 2 as part2, 3 as part3, 4 as part4if you opt to stick with regex, we will only extract the first matching pattern, unless you end with multi keyword, or use two or more capture groups. Multi will duplicate the messages so don't recommend for this use case. A regex that does same as above split would be something like below:
| parse regex field=_sourcecategory "/?(?<part1>[^/]+)/(?<part2>[^/]+)/(?<part3>[^/]+)/(?<part4>[^/]+)/?"
Hope this helps.
Regards,
Matt
Comment actions
Please sign in to leave a comment.
Comments
1 comment