GROK Pattern in Streamsets Log Parser

Streamsets Log Parser allows you to parse and ingest Log Files from server
There are multiple pre-defined "Log Formats" to choose from such as CommonLog Format or Combined Log Format for Apache Access Logs
However, if you have defined your own log format then "GROK" patterns are great way to configure Log Parser to consume them.

The real challenge however is how should you define you GROK Pattern.
Test Grok Patterns (https://grokconstructor.appspot.com/do/matchis a great website to enter your GROK pattern and log line and test if things will work.

It also provides an "Automatic" mode (https://grokconstructor.appspot.com/do/automatic)
This will generate the GROK pattern for you based on the log line that you provide.

However, if you are using a customized version of Apache access log then you can use standard GROK patterns to match your log line.

For example, for my access log line GROK pattern is given below

Log Line
103.107.92.250 - - [21/Apr/2019:17:34:35 +0530] "GET /form/track-shipment/ HTTP/1.1" 200 8324 "http://onlinexpress.co.in/form/track-shipment/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36" 400

Grok Pattern

%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{NUMBER:responseTime}


Streamsets Log Parser Configuration


In the screenshot MYPATTERN is the custom name that I have given for my pattern in "GROK Pattern Definition" field.
The first word is always the pattern name, which is to be entered in the "GROK Pattern" field.

Setting up ELK and Streamsets in CWP (Cent OS 7)

Though getting started with ELK itself is pretty straight-forward and simple; I ran into small hurdles lately and I thought it would make sense to make a note of it and share with all of you.

Step 1 : Missing JAVA
So the very first thing you will find missing is the JDK 8 which is mandatory for all these tools.
Please do not install OpenJDK1.8 as it has missing packages and will land you in trouble later. We need to install Oracle JAVA 8
You can download the RPM from here  -
https://www.oracle.com/technetwork/java/javaee/downloads/jdk8-downloads-2133151.html

File downloaded: jdk-8u211-linux-x64.rpm

Make it executable : chmod +x jdk-8u211-linux-x64.rpm

Install: sudo yum install jdk-8u211-linux-x64.rpm

Set the JAVA_PATH: With Non Sudo user run these commands
vi .bash_profile
Add this line at end: export JAVA_HOME=/usr/java/jdk1.8.0_211-amd64/jre/bin



Step 2: Get Elastic Search
Please do not download the ZIP package. There is an RPM version available as well and it installs as a service. You can download it from here -
https://www.elastic.co/downloads/past-releases/elasticsearch-6-2-3

File downloaded: elasticsearch-6.2.3.rpm

Make it executable : chmod +x elasticsearch-6.2.3.rpm

Install: sudo yum install elasticsearch-6.2.3.rpm

Run: 
  sudo systemctl daemon-reload
  sudo systemctl enable elasticsearch.service
  systemctl start elasticsearch.service
  systemctl status  elasticsearch.service


Step 3: Get Kibana
Please do not download the ZIP package. There is an RPM version available as well and it installs as a service. You can download it from here -
https://www.elastic.co/downloads/past-releases/kibana-6-2-3

(NOTE: The version of Elastic Search and Kibana should be same)

File downloaded: kibana-6.2.3-x86_64.rpm

Make it executable : chmod +x kibana-6.2.3-x86_64.rpm

Install: sudo yum install kibana-6.2.3-x86_64.rpm

Run:
  sudo systemctl daemon-reload
  sudo systemctl enable kibana.service
  systemctl start kibana.service
  systemctl status  kibana.service


Step 4: Configure Kibana Host in config file
Location of Config file:  /etc/kibana/kibana.yml
Update these configurations -
server.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"
elasticsearch.username: "admin"
elasticsearch.password: "password"


Step 5: Get Streamsets
You can download the zip from here:

wget https://archives.streamsets.com/datacollector/3.8.1/tarball/streamsets-datacollector-core-3.8.1.tgz

Untar the .tgz file: tar -zxf streamsets-datacollector-core-3.8.1.tgz
Increase Ulimit: 
vi /etc/security/limits.conf
Add these lines at end of file i.e.
*    soft    nofile 65000
*    hard    nofile 65000

Run:   nohup streamsets-datacollector-3.8.1/bin/streamsets dc &

Step 6: Enable Ports in Firewall
Following ports need to be opened i.e. 5601, 9200