Extraction

String extraction is one of the main tasks that all programmers need. Sometimes it's getting even harder because we don't get an easy string presentation to extract useful data/information. Here some,

1. Extracting Network Strings

1.1. Extracting MAC address from string

We need to extract all MAC address from an arbitrary string

mac = "ads fs:ad fa:fs:fe: Wind00-0C-29-38-1D-61ows 1100:50:7F:E6:96:20dsfsad fas fa1 3c:77:e6:68:66:e9 f2"

Using Regular Expressions

The regular expression should supports windows and Linux mac address formats.

lets to find our mac

mac_regex = /(?:[0-9A-F][0-9A-F][:\-]){5}[0-9A-F][0-9A-F]/i
mac.scan mac_regex

Returns

["00-0C-29-38-1D-61", "00:50:7F:E6:96:20", "3c:77:e6:68:66:e9"]

1.2. Extracting IPv4 address from string

We need to extract all IPv4 address from an arbitrary string

ip = "ads fs:ad fa:fs:fe: Wind10.0.4.5ows 11192.168.0.15dsfsad fas fa1 20.555.1.700 f2"
ipv4_regex = /(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/

Let's to find our IPs

ip.scan ipv4_regex

Returns

[["10", "0", "4", "5"], ["192", "168", "0", "15"]]

1.3. Extracting IPv6 address from string

https://gist.github.com/cpetschnig/294476

http://snipplr.com/view/43003/regex--match-ipv6-address/

2. Extracting Web Strings

2.1. Extracting URLs from file

Assume we have the following string

string = "text here http://foo1.example.org/bla1 and http://foo2.example.org/bla2 and here mailto:[email protected] and here also."


Using Regular Expressions

string.scan(/https?:\/\/[\S]+/)

Using standard URI module This returns an array of URLs

require 'uri'
URI.extract(string, ["http" , "https"])

2.2. Extracting URLs from web page

Using above tricks

require 'net/http'
URI.extract(Net::HTTP.get(URI.parse("http://rubyfu.net")), ["http", "https"])

or using a regular expression

require 'net/http'
Net::HTTP.get(URI.parse("http://rubyfu.net")).scan(/https?:\/\/[\S]+/)

2.3. Extracting Email Addresses from Web Page

require 'net/http'
Net::HTTP.get(URI.parse("http://pastebin.com/khAmnhsZ")).scan(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i).uniq

2.4. Extracting Strings from HTML tags

Assume we have the following HTML contents and we need to get strings only and eliminate all HTML tags.


string = "<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p>This is another <strong>contents</strong>.</p>

</body>
</html>"


puts string.gsub(/<.*?>/,'').strip

Returns

Page Title



This is a Heading
This is another contents.

results matching ""

    No results matching ""