Hello everyone, today I will be talking about the core part of my work during my summer internship at Samsung Semiconductor India R&D Center, Bengaluru. Our project team consisted of four members and we were given the task to parse log files generated by a testing framework called as Jenkins (now a very popular tool for continuous integration and deployment of source code) and present the parsed data in a HTML, XLSX and JSON format. We were also required to develop plotting and charting utilities for the parsed data in the Excel files.
Log files are simple text file containing huge amount of information. These files are however unstructured and the data they contain is not in a presentable format and it cannot be grasped quickly. There is no organisation of data and no scope for visualization of the data. Therefore, parsing is a very important step if you want to work upon, present, analyze the data in your log files. We initially thought of using some existing popular tools like the MS log parser, the ELK (Elastic Search, Logstash, Kibana) stack, etc. But these tools require the log files to retain a fixed format throughout the file and proved to be unsuitable for our use. We then moved on to developing Python scripts (parsers) for the log files. Python gets its immense power through its humungous number of modules and a simple import of these modules can get any of your work done!
The python re module (regular expression) contains most of the functions that a parsing utility of any language would provide you with. This coupled with the string handling functions in Python will get your parsing work done. By parsing, I mean that the you will be able to extract the data from the log files and put them in Python data structures (list, dictionaries, strings and their combinations)
Making the JSON file is perhaps the most important step after parsing. The parsed data present in the Python data structures will be dumped to a JSON file. The JSON file format provides us with a very easy way to retrieve information just like data retrieval in arrays and dictionaries. The python JSON module can write data from a data structure to a JSON file and can load the data from a JSON file to a data structure
There are many Python modules for achieving this task. They include xlwt, openpyxl, xlrd, xlsxwriter, etc. The most popular is the xlsxwriter, perhaps because of its excellent online documentation and its extensive and diverse functionalities. Also, this is the only module that supports plotting on an Excel sheet and hence we opted for this module. The python data structures containing the parsed data are written in a tabular format in an Excel worksheet.
The xlsxwriter module contains utilities for graphing and charting. You can decide the type of chart (line, bar, column, pie), set its axes (units, scale, labels, linear/logarithmic), set chart legends, add data labels, formatting (chart color, plot color, trend lines, gridlines, border) and many more features.
The xlsx files are suitable when you want to store and export the parsed data. However, for presentation purposes, the HTML format is more suitable. There are command-line based tools like the sswriter and unoconv which help us with this. These are generic tools which can help in conversion of files from one format to another.
To conclude, I wanted to share my learnings during my internship tenure at Samsung and provide you with the knowledge of how data should be parsed and presented and the python modules/tools that can be used to achieve the same.