JAVA使用爬虫抓取网站网页内容的方法

前端技术 2023/09/04 Java

本文实例讲述了JAVA使用爬虫抓取网站网页内容的方法。分享给大家供大家参考。具体如下：

最近在用JAVA研究下爬网技术,呵呵，入了个门,把自己的心得和大家分享下
以下提供二种方法，一种是用apache提供的包．另一种是用JAVA自带的.

代码如下:

// 第一种方法
//这种方法是用apache提供的包,简单方便
//但是要用到以下包:commons-codec-1.4.jar
// commons-httpclient-3.1.jar
// commons-logging-1.0.4.jar
public static String createhttpClient(String url, String param) {
  HttpClient client = new HttpClient();
  String response = null;
  String keyword = null;
  PostMethod postMethod = new PostMethod(url);
//  try {
//   if (param != null)
//    keyword = new String(param.getBytes(\"gb2312\"), \"ISO-8859-1\");
//  } catch (UnsupportedEncodingException e1) {
//   // TODO Auto-generated catch block
//   e1.printStackTrace();
//  }
  // NameValuePair[] data = { new NameValuePair(\"keyword\", keyword) };
  // // 将表单的值放入postMethod中
  // postMethod.setRequestBody(data);
  // 以上部分是带参数抓取,我自己把它注销了．大家可以把注销消掉研究下
  try {
   int statusCode = client.executeMethod(postMethod);
   response = new String(postMethod.getResponseBodyAsString()
     .getBytes(\"ISO-8859-1\"), \"gb2312\");
     //这里要注意下 gb2312要和你抓取网页的编码要一样
   String p = response.replaceAll(\"//&[a-zA-Z]{1,10};\", \"\")
     .replaceAll(\"<[^>]*>\", \"\");//去掉网页中带有html语言的标签
   System.out.println(p);
  } catch (Exception e) {
   e.printStackTrace();
  }
  return response;
}
// 第二种方法
// 这种方法是JAVA自带的URL来抓取网站内容
public String getPageContent(String strUrl, String strPostRequest,
   int maxLength) {
  // 读取结果网页
  StringBuffer buffer = new StringBuffer();
  System.setProperty(\"sun.net.client.defaultConnectTimeout\", \"5000\");
  System.setProperty(\"sun.net.client.defaultReadTimeout\", \"5000\");
  try {
   URL newUrl = new URL(strUrl);
   HttpURLConnection hConnect = (HttpURLConnection) newUrl
     .openConnection();
   // POST方式的额外数据
   if (strPostRequest.length() > 0) {
    hConnect.setDoOutput(true);
    OutputStreamWriter out = new OutputStreamWriter(hConnect
      .getOutputStream());
    out.write(strPostRequest);
    out.flush();
    out.close();
   }
   // 读取内容
   BufferedReader rd = new BufferedReader(new InputStreamReader(
     hConnect.getInputStream()));
   int ch;
   for (int length = 0; (ch = rd.read()) > -1
     && (maxLength <= 0 || length < maxLength); length++)
    buffer.append((char) ch);
   String s = buffer.toString();
   s.replaceAll(\"//&[a-zA-Z]{1,10};\", \"\").replaceAll(\"<[^>]*>\", \"\");
   System.out.println(s);
   rd.close();
   hConnect.disconnect();
   return buffer.toString().trim();
  } catch (Exception e) {
   // return \"错误:读取网页失败！\";
   //
   return null;
  }
}

然后写个测试类:

public static void main(String[] args) {
  String url = \"http://www.phpstudy.net\";
  String keyword = \"phpstudy\";
  createhttpClient p = new createhttpClient();
  String response = p.createhttpClient(url, keyword);
  // 第一种方法
  // p.getPageContent(url, \"post\", 100500);//第二种方法
}

呵呵，看看控制台吧,是不是把网页的内容获取了

希望本文所述对大家的java程序设计有所帮助。

本文地址：https://www.stayed.cn/item/11244

转载请注明出处。

本站部分内容来源于网络,如侵犯到您的权益,请联系我

微信
QQ好友
QQ空间
腾讯微博
新浪微博
人人网

我的博客

人生若只如初见，何事秋风悲画扇。

我的标签

随笔档案

2024-02(2)
2023-06(1)
2023-05(1)
2023-04(14)
2023-03(3)
2023-01(6)
2022-12(5)
2022-11(5)
2022-07(2)
2022-06(4)
2022-05(3)
2022-03(1)
2021-12(6)
2021-11(1)
2021-10(3)
2021-09(5)
2021-07(5)
2021-02(2)
2021-01(7)
2020-12(18)
2020-11(14)
2020-10(12)
2020-09(10)
2020-08(22)
2020-07(2)
2020-06(1)
2020-04(5)
2020-03(9)
2020-02(7)
2020-01(9)
2019-12(8)
2019-11(10)
2019-10(11)
2019-09(17)
2019-08(16)
2019-07(6)
2019-06(3)
2019-04(1)
2019-03(8)
2019-02(5)
2019-01(1)
2018-11(2)
2018-10(3)
2018-09(1)
2018-08(3)
2018-07(3)
2018-06(7)
2018-04(4)
2018-03(5)
2018-02(4)
2018-01(22)
2017-12(3)
2017-11(5)
2017-10(15)
2017-09(26)
2017-08(1)
2017-07(3)