Query CSV upload
I’ve uploaded some csv files.
But this query below is taking forever to complete. Is there a better way to make it work faster?
_sourceName=/texto_file.csv | extract "(?<last>[^\"]*)\",\"(?<ip>[^\"]*)\",\"(?<asn>[^\"]*)\",(?<asn\_code>[^,]*),\"(?<city>[^\"]*)\",\"(?<region>[^\"]*)\",\"(?<country>[^\"]*)\",\"(?<host>[^\"]*)\",\"(?<rdns\_domain>[^\"]*)\",\"(?<trojan>[^\"]*)" | fields ip, asn, trojan, country, rdns_domain | count_distinct (ip) group by rdns_domain
-
In your case it looks like you are not too concerned about the actual format of the text you want within the field so the "Parse Anchor" operator should work better for you and return a bit more quickly. I can't see the actual text of one of your log messages but based on your original expression the following should get you the same/similar results and should return a bit faster.
_sourceName=/texto_file.csv | parse "*\",\"*\",\"*\",*,\"*\",\"*\",\"*\",\"*\",\"*\",\"*" as last, ip, asn, asn_code, city, region, country, host, rdns_domain, trojan | fields ip, asn, trojan, country, rdns_domain | count_distinct (ip) group by rdns_domain
The extract operator can be a bit heavy in this case. "Parse Regex" (extract) is really better suited for defining a specific text string to pull from the logs, such as an IP address, Social Security Number etc... The Parse Regex operator requires additional processing as each message and then each field is matched to the expression, which can add time to the query.
Additionally, the time of the query can be affected by the length of the timerange being queried and by how selective the query is, ie. how may keywords you use to filterer the messages.
-
Your query is over a 10 month period so unfortunately it is expected that this would take quite a while to return all your results. You may want to try breaking this down into a few smaller time ranges (30 - 90 days) and run the queries in parallel.
This length of timerange is actually not our standard use case. The availability of messages are typically bound by the retention period of the account, (7 days in case of Sumo Logic Free) and as a result most queries tend to fall within the range of the contracted retention period.
-
My logs are just from July the first. If you entered this date for the beginning of the query date, you will get the same results.
Maybe this product it's not for me. I'm also trying other options to base my future company. And for now your are losing for treasure-data, splunk storm, and others.
Why don't you treat csv files in fields? Instead of one field for everything.
-
Might the csv operator be faster for you, as in:
csv _raw extract 1 as last, 2 as ip, 3 as asn, 4 as asn_code, 5 as city etc.
Mind, your parse statement is a means to run a field extraction on a csv as you can't use the csv operator in a field extraction and so your syntax is a lesson to me at least :-)
Please sign in to leave a comment.
Comments
8 comments