Linker IT Software
menubar-top-links menubar-top-rechts
Home Help Search Login
Welcome, Guest. Please Login.
SQL*XL: Database to Excel bridge litLIB: Excel power functions pack ExcelLock: Locking and securing your valuable Excel spreadsheets encOffice: Protect your Excel file easy and safe encOffice: Protect your Excel file easy and safe
Pages: 1
Analysing a huge text file in minutes (Read 2189 times)
Gerrit-Jan Linker
YaBB Administrator

Posts: 75
Analysing a huge text file in minutes
12.04.16 at 22:14:42
Analysing a huge text file in minutes
In this topic I will show that analysis of large text files is easy using SQL*XL.  
It is based on a real life problem with a php web server that ran over its disk quota. Analysing the problem, the php.log file had grown to 3.8 GB. Warning messages are logged in this file. The task at hand is to find out what the problems are on this php web site.
I will show that within minutes, with the help of SQL*XL, 20 problems could be identified that covered 90% of the contents of the php.log file. I will show you step by step how this was done.  
Step 1: Obtaining a copy of the file.  
I used an ftp program to download the file file. I renamed the file to php.txt. The file is too large to open with notepad. I also tried notepad++, ultraedit and PFE, all with not too good results.
Step 2: Access the file through SQL
Using the SQL*XL software, Excel can be connected to this text file. Once connected, SQL can be used to analyse its contents. SQL*XL is going to 'connect' to the directory that contains the text file.  Files in the directory can be accessed as tables.
To set up a new connection in SQL*XL, using the connect button in the SQL*XL ribbon in MS Excel. Select to setup a new connection and start the Database Connection Wizard. Select text/csv as the database type and specify the directory.  
Now, SQL*XL will try out several connections. On my machine it managed to connect through two drivers: an ODBC driver and an OLEDB connection through the Microsoft Jet driver. I chose the Jet driver.  
Step 3: Counting the number of lines in the text file:
In SQL*XL's SQL dialog the number of lines in the file can be retrieved with a 'count(*) query':
select count(*) from [php.txt];
The query result is put in Excel cells: 23287499. So there are over 23 million rows to be analysed. Note that the table name should always be put in between square brackets [].

Step 4: Retrieve a selection of the file data:
In SQL dialog a 'top x query' was typed and run:
select top 100 * from [php.txt];
It retireved 100 rows from the file. The data was like this:
'[18-Jan-2016 11:55:30 Europe/Amsterdam] PHP Notice:  Undefined index: action in /home/1234/sites/abc/ on line 12
All warning messages seem to start with a date/time location statement after of which the actual warning is displayed. We need to skip the first part of the line so we can work only with the warning messages.
Step 5: Define the layout of the table
It helps to let SQL*XL view the table as having 2 columns. The first can contain all the data we don't need, the date, time and location, and a second column that we can analyse further.
We can specify this by creating a small text file in the same directory as the php.txt file. Name the filie schema.ini. I entered the following information in the schema.ini file to indicate that the php.txt file contains fixed length data, that it contains no column headers and that it contains two columns. I specified that the first 40 characters to form part of column a and the next 100 characters are regarded as column b:
Col1=a Text Width 40
Col2=b Text Width 100

Step 6: Retrieve the unique messages and count their occurrences
A 'group by query' can be used for this. In the SQL dialog type:
select count(b), b from [php.txt] group by b order by count(b) desc;
Explanation of this query:
I wish to retrieve the warning message of column b but also count how often that message occurs in the file. Hence the select count(b), b. Group by b specifies that the data should be grouped by column b. Using the order by clause I specified that the messages with the highest occurrence should be listed first.
The result:
No      per.      number      warning
1      12%      2694801      PHP Notice:  Undefined index: escape_output in /home/1234/sites/abcd/
2      11%      2610967      PHP Notice:  Undefined property: stdClass::$edittime in /home/1234/sites/abcd/
3      11%      2598029      PHP Notice:  Undefined index: target in /home/1234/sites/abcd/
4      9%      2206085      PHP Notice:  Use of undefined constant MODULE_NAME - assumed 'MODULE_NAME'  
5      9%      2031366      PHP Notice:  Undefined index: function in /home/1234/sites/abcd/
6      7%      1515632      PHP Notice:  Undefined index: width in /home/1234/sites/abcd/
7      5%      1261983      PHP Deprecated:  Function ereg() is deprecated in /home/1234/sites/abcd/
8      4%      1021392      PHP Notice:  Undefined index: prijs in /home/1234/sites/abcd/
9      4%      1021028      PHP Notice:  Undefined index: prijs_uitleg in /home/1234/sites/abcd/
10      3%      739672      PHP Notice:  Constant QDOM_NEWLINE_STR already defined in /home/1234/sites/abcd/
To be fair, I added the first two columns by hand with Excel.  
Back to top
« Last Edit: 13.04.16 at 08:31:47 by Gerrit-Jan Linker »  

Gerrit-Jan Linker
Linker IT Software
Email WWW Gerrit-Jan Linker   IP Logged
Pages: 1