Extraction
String extraction is one of the main tasks that all programmers need. Sometimes it's getting even harder because we don't get an easy string presentation to extract useful data/information. Here some,
1. Extracting Network Strings
1.1. Extracting MAC address from string
We need to extract all MAC address from an arbitrary string
mac = "ads fs:ad fa:fs:fe: Wind00-0C-29-38-1D-61ows 1100:50:7F:E6:96:20dsfsad fas fa1 3c:77:e6:68:66:e9 f2"
Using Regular Expressions
The regular expression should supports windows and Linux mac address formats.
lets to find our mac
mac_regex = /(?:[0-9A-F][0-9A-F][:\-]){5}[0-9A-F][0-9A-F]/i
mac.scan mac_regex
Returns
["00-0C-29-38-1D-61", "00:50:7F:E6:96:20", "3c:77:e6:68:66:e9"]
1.2. Extracting IPv4 address from string
We need to extract all IPv4 address from an arbitrary string
ip = "ads fs:ad fa:fs:fe: Wind10.0.4.5ows 11192.168.0.15dsfsad fas fa1 20.555.1.700 f2"
ipv4_regex = /(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/
Let's to find our IPs
ip.scan ipv4_regex
Returns
[["10", "0", "4", "5"], ["192", "168", "0", "15"]]
1.3. Extracting IPv6 address from string
https://gist.github.com/cpetschnig/294476
http://snipplr.com/view/43003/regex--match-ipv6-address/
2. Extracting Web Strings
2.1. Extracting URLs from file
Assume we have the following string
string = "text here http://foo1.example.org/bla1 and http://foo2.example.org/bla2 and here mailto:[email protected] and here also."
Using Regular Expressions
string.scan(/https?:\/\/[\S]+/)
Using standard URI module This returns an array of URLs
require 'uri'
URI.extract(string, ["http" , "https"])
2.2. Extracting URLs from web page
Using above tricks
require 'net/http'
URI.extract(Net::HTTP.get(URI.parse("http://rubyfu.net")), ["http", "https"])
or using a regular expression
require 'net/http'
Net::HTTP.get(URI.parse("http://rubyfu.net")).scan(/https?:\/\/[\S]+/)
2.3. Extracting Email Addresses from Web Page
require 'net/http'
Net::HTTP.get(URI.parse("http://pastebin.com/khAmnhsZ")).scan(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i).uniq
2.4. Extracting Strings from HTML tags
Assume we have the following HTML contents and we need to get strings only and eliminate all HTML tags.
string = "<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is another <strong>contents</strong>.</p>
</body>
</html>"
puts string.gsub(/<.*?>/,'').strip
Returns
Page Title
This is a Heading
This is another contents.