[regex]JAVA 정규표현식을 이용한 패턴매칭(HTML 제거)

JAVA 정규표현식을 이용한 패턴매칭(HTML 제거)

JAVA 에서도 정규표현식을 이용하여 패턴매칭이 가능하다.

다음은 정규표현식을 사용하여, 숫자와 영문을 제거하는 소스코드이다.

소스코드

import java.util.regex.*;

....

private String removeChar(String inp){

// 띄어쓰기 제거

String tmp = inp.replaceAll(" ", "");

// 숫자 제거

tmp = this.removeRex("[0-9]", tmp);

// 영문 제거

tmp = this.removeRex("[a-zA-Z]", tmp);

return tmp;

}

// 패턴 제거

private String removeRex(String rex, String inp){

Pattern numP = Pattern.compile(rex);

Matcher mat = numP.matcher("");

mat.reset(inp);

inp = m.replaceAll("");

return inp ;

}

이런 정규 표현식을 이용하여 , HTML 태그를 제거할 수 있다.

다음은 HTML 태그를 제거하는 함수이다.

소스코드

import java.util.regex.Matcher;

import java.util.regex.Pattern;

....

while(true){

String str = bf.readLine();

if(str == null) break;

if(str.length() == 0) pw.println();

str = this.removeTag(str);

System.out.println(str);

}

....

public String removeTag(String str){

Matcher mat;

// script 처리

Pattern script = Pattern.compile("<(no)?script[^>]*>.*?</(no)?script>",Pattern.DOTALL);

mat = script.matcher(str);

str = mat.replaceAll("");

// style 처리

Pattern style = Pattern.compile("<style[^>]*>.*</style>",Pattern.DOTALL);

mat = style.matcher(str);

str = mat.replaceAll("");

// tag 처리

Pattern tag = Pattern.compile("<(\"[^\"]*\"|\'[^\']*\'|[^\'\">])*>");

mat = tag.matcher(str);

str = mat.replaceAll("");

// ntag 처리

Pattern ntag = Pattern.compile("<\\w+\\s+[^<]*\\s*>");

mat = ntag.matcher(str);

str = mat.replaceAll("");

// entity ref 처리

Pattern Eentity = Pattern.compile("&[^;]+;");

mat = Eentity.matcher(str);

str = mat.replaceAll("");

// whitespace 처리

Pattern wspace = Pattern.compile("\\s\\s+");

mat = wspace.matcher(str);

str = mat.replaceAll("");

return str ;

}

매칭된 부분을 출력

Pattern script = Pattern.compile("\\[.*\\]$");

mat = script.matcher(str);

while(mat.find()){

System.out.println(mat.group());

}

JAVA 정규표현식을 이용한 패턴매칭(HTML 제거)

JAVA 에서도 정규표현식을 이용하여 패턴매칭이 가능하다.

다음은 정규표현식을 사용하여, 숫자와 영문을 제거하는 소스코드이다.

소스코드

import java.util.regex.*;

....

private String removeChar(String inp){

// 띄어쓰기 제거

String tmp = inp.replaceAll(" ", "");

// 숫자 제거

tmp = this.removeRex("[0-9]", tmp);

// 영문 제거

tmp = this.removeRex("[a-zA-Z]", tmp);

return tmp;

}

// 패턴 제거

private String removeRex(String rex, String inp){

Pattern numP = Pattern.compile(rex);

Matcher mat = numP.matcher("");

mat.reset(inp);

inp = m.replaceAll("");

return inp ;

}

이런 정규 표현식을 이용하여 , HTML 태그를 제거할 수 있다.

다음은 HTML 태그를 제거하는 함수이다.

소스코드

import java.util.regex.Matcher;

import java.util.regex.Pattern;

....

while(true){

String str = bf.readLine();

if(str == null) break;

if(str.length() == 0) pw.println();

str = this.removeTag(str);

System.out.println(str);

}

....

public String removeTag(String str){

Matcher mat;

// script 처리

Pattern script = Pattern.compile("<(no)?script[^>]*>.*?</(no)?script>",Pattern.DOTALL);

mat = script.matcher(str);

str = mat.replaceAll("");

// style 처리

Pattern style = Pattern.compile("<style[^>]*>.*</style>",Pattern.DOTALL);

mat = style.matcher(str);

str = mat.replaceAll("");

// tag 처리

Pattern tag = Pattern.compile("<(\"[^\"]*\"|\'[^\']*\'|[^\'\">])*>");

mat = tag.matcher(str);

str = mat.replaceAll("");

// ntag 처리

Pattern ntag = Pattern.compile("<\\w+\\s+[^<]*\\s*>");

mat = ntag.matcher(str);

str = mat.replaceAll("");

// entity ref 처리

Pattern Eentity = Pattern.compile("&[^;]+;");

mat = Eentity.matcher(str);

str = mat.replaceAll("");

// whitespace 처리

Pattern wspace = Pattern.compile("\\s\\s+");

mat = wspace.matcher(str);

str = mat.replaceAll("");

return str ;

}

매칭된 부분을 출력

Pattern script = Pattern.compile("\\[.*\\]$");

mat = script.matcher(str);

while(mat.find()){

System.out.println(mat.group());

}

JAVA 정규표현식을 이용한 패턴매칭(HTML 제거)

JAVA 에서도 정규표현식을 이용하여 패턴매칭이 가능하다.

다음은 정규표현식을 사용하여, 숫자와 영문을 제거하는 소스코드이다.

소스코드

import java.util.regex.*;

....

private String removeChar(String inp){

// 띄어쓰기 제거

String tmp = inp.replaceAll(" ", "");

// 숫자 제거

tmp = this.removeRex("[0-9]", tmp);

// 영문 제거

tmp = this.removeRex("[a-zA-Z]", tmp);

return tmp;

}

// 패턴 제거

private String removeRex(String rex, String inp){

Pattern numP = Pattern.compile(rex);

Matcher mat = numP.matcher("");

mat.reset(inp);

inp = m.replaceAll("");

return inp ;

}

이런 정규 표현식을 이용하여 , HTML 태그를 제거할 수 있다.

다음은 HTML 태그를 제거하는 함수이다.

소스코드

import java.util.regex.Matcher;

import java.util.regex.Pattern;

....

while(true){

String str = bf.readLine();

if(str == null) break;

if(str.length() == 0) pw.println();

str = this.removeTag(str);

System.out.println(str);

}

....

public String removeTag(String str){

Matcher mat;

// script 처리

Pattern script = Pattern.compile("<(no)?script[^>]*>.*?</(no)?script>",Pattern.DOTALL);

mat = script.matcher(str);

str = mat.replaceAll("");

// style 처리

Pattern style = Pattern.compile("<style[^>]*>.*</style>",Pattern.DOTALL);

mat = style.matcher(str);

str = mat.replaceAll("");

// tag 처리

Pattern tag = Pattern.compile("<(\"[^\"]*\"|\'[^\']*\'|[^\'\">])*>");

mat = tag.matcher(str);

str = mat.replaceAll("");

// ntag 처리

Pattern ntag = Pattern.compile("<\\w+\\s+[^<]*\\s*>");

mat = ntag.matcher(str);

str = mat.replaceAll("");

// entity ref 처리

Pattern Eentity = Pattern.compile("&[^;]+;");

mat = Eentity.matcher(str);

str = mat.replaceAll("");

// whitespace 처리

Pattern wspace = Pattern.compile("\\s\\s+");

mat = wspace.matcher(str);

str = mat.replaceAll("");

return str ;

}

매칭된 부분을 출력

Pattern script = Pattern.compile("\\[.*\\]$");

mat = script.matcher(str);

while(mat.find()){

System.out.println(mat.group());

}

출처 : http://ra2kstar.tistory.com/119

저작자표시

fmd1225's One day

[regex]JAVA 정규표현식을 이용한 패턴매칭(HTML 제거)

티스토리툴바