抓取动态网页数据和JSON解码教程
这个教程主要用 Flutter / Dart 语言来演示如何抓取动态网页数据并JSON解码为Map 或 List。当然它的原理同样适用于 Java 和 Python。
一谈到抓取网页数据,我们就想到 Beautiful Soup 。但事实是现在大多数的网页都是动态的,而Beautiful Soup 只适用于静态网页。
今天我就解释一下如何抓取动态网页的数据,举例的网站是 TheWeatherNetwork.com ,它是最好的气象网站之一。
1. 确定网页是动态还是静态
看一下 The Weather Network 网站的截屏,显示的是加拿大 Vancouver, BC 的当前天气。
这个教程将抓取图中3个红圈中的数据,分别是当前天气描述,当前温度和当前体感(feels like)温度。
Beautiful Soup 只能分析网页源代码的数据内容。我们来看一下网页源代码中有没有当前天气描述 "A few clouds" (少云)。
这里用的浏览器是谷歌 Chrome,Firefox 也是有类似的功能的。
鼠标右击网页空白处并点选 "View page source" (查看网页源代码)。
一个新页面会打开。用 Ctrl+F 快速搜索一下 "A few clouds",返回 0/0 个结果。
这就是说 TheWeatherNetwork 是一个动态网站。"A few clouds" 没有写在源代码中,那它一定是从另外一个地方获取的。我们就是要找出那“另外一个地方”。
现在再回到 TheWeatherNetwork.com 并右击空白处,选 "Inspect"(检查)。
如果在菜单栏没有 "Network" (网络),点 ">>" 再选择 "Network"。
现在这个 "Network" 页面应该是空白的,因为页面已经加载完了,当前没有网络活动。如果你还是看到一些栏目,点 "Clear" (清除)键清除它们。然后按 Ctrl+R 刷新页面,点 XHR 过滤这些栏目,使得仅剩 "XMLHttpRequest" 项目。
现在左边出现很多 XHR 栏目。点击每一个项目再点 "Preview"(预览),快速查看一下内容。比如下面这个项目肯定是和亚马逊广告有关。
这样快速看一遍后我们就发现大多数栏目是有关广告的请求。只有前面两三个是有气象数据的。这个 "cabc0308" 栏目正是我们要找的。注意 "cabc0308" 是温哥华的地方代码,其他城市会有不同的代码。
现在点击 "Headers"(标头),选取这个请求 URL,并右击选择打开这个 URL 网页。
出现一个新页面,里面有我们所需的所有数据。稍微改一下格式,得到如下数据:
{
"observation":
{
"time":{"local":"2020-08-28T17:45","utc":"2020-08-29T00:45"},
"weatherCode":
{"value":"SCT","icon":2,"text":"A few clouds","bgimage":"clearday","overlay":"sunny"},
"temperature":21,"dewPoint":15,"feelsLike":23,
"wind":{"direction":"W","speed":11,"gust":17},
"relativeHumidity":69,"pressure":{"value":101.5,"trendKey":1},
"visibility":32,"ceiling":10000
},
"display":
{
"imageUrl":"//s1.twnmm.com/images/en_ca/",
"unit":
{"temperature":"C","dewPoint":"C","wind":"km/h","relativeHumidity":"%",
"pressure":"kPa","visibility":"km","ceiling":"m"
}
}
}
因为最前面是 "{",所以它是一个 Map,有 2 个主键值:"observation" 和 "display"。我们要找的关键值 "A few clouds" 在这里:
"observation" - "weatherCode" - "text" - "A few clouds"
当前温度在这里:
"observation" - "temperature" - 21, 是一个 integer 整数。
现在我们可以自己请求一下这个URL:
https://weatherapi.pelmorex.com/api/v1/observation/placecode/cabc0308
只需短短几行代码就搞定。
import 'package:http/http.dart' as http;
...
var placeCode = "cabc0308";
var _searchURL =
'https://weatherapi.pelmorex.com/api/v1/observation/placecode/' +
placeCode;
var response = await http.Client().get(Uri.parse(_searchURL));
2. Flutter: 轻松 JSON 解码
Flutter 的官方 JSON 解码网页在这里: JSON and serialization
这里用的方法是轻松手动解码,适用于大多数的小项目。
首先把上面得到的 "response" 转换为 Flutter Map。
import 'dart:convert';
...
Map <String dynamic> mResponse = json.decode(response.body);
下面有好几种方法可以获得关键值 "A few clouds" :
print(mResponse['observation']['weatherCode']['text']);
//or
var observation = Map<String, dynamic>.from(mResponse['observation'] ?? '');
print(observation['weatherCode']['text'];
//or
var weatherCode = Map<String, dynamic>.from(observation['weatherCode'] ?? '');
print(weatherCode['text'];
类似地,我们也可以抓取当前温度和当前体感温度。
这里我们还有一个问题。在上面的请求 URL 中:
https://weatherapi.pelmorex.com/api/v1/observation/placecode/cabc0308
我们用到温哥华的地方代码 "cabc0308" 。如果我们事先不知道城市的名字怎么办呢?如何用编程的方式获取城市代码呢?
我们再回去 TheWeatherNetwork 网页,注意到上面有一个搜索栏。输入一个城市名,不用按回车健,网页就会显示很多搜索结果。
再回到 "Network" 页面,输入 "toronto" 搜索。马上就有一个XHR 栏目出现。点击 "Headers",这就是我们要的请求 URL。
获得的请求 URL 是
稍微改一下变为
https://www.theweathernetwork.com/api/location/search?searchText=toronto&lat=&long=
把这个 URL 拷贝并粘贴到浏览器,我们得到这些原始数据:
它是以 "[" 开头,所以它是一个 List,里面是 Map。下面就把它 JSON 解码为 Flutter 的 List:
var _searchURL =
'https://www.theweathernetwork.com/api/location/search?searchText=' +
_cityInputValue +
'&lat=&long=';
final response = await http.Client().get(Uri.parse(_searchURL));
if (response.statusCode == 200) { // connection successful
setState(() {
_saving = false;
});
cityList = new List();
var jList = json.decode(response.body) as List;
jList.forEach((element) {
var mElement = Map<String, dynamic>.from(element ?? '');
if (mElement['type'] == 'city') {
cityList.add(mElement);
}
});
用这些条件可以搜索到 Toronto ON Canada:
"type" == "city" 和 "province" == "Ontario"。
这样就得到 Toronto 的地方代码:
"code":"caon0696".
下面就是所有的代码和这个教程动态演示:
pubspec.yaml 文件
name: flutter_web_scraping_dynamic
description: A new Flutter application.
publish_to: 'none' # Remove this line if you wish to publish to pub.dev
version: 1.0.0+1
environment:
sdk: ">=2.7.0 <3.0.0"
dependencies:
flutter:
sdk: flutter
cupertino_icons: ^0.1.3
# TODO: 添加这些 dependencies
modal_progress_hud: ^0.1.3
http: ^0.12.2
dev_dependencies:
flutter_test:
sdk: flutter
flutter:
uses-material-design: true
main.dart 文件
import 'package:flutter/material.dart';
import 'package:modal_progress_hud/modal_progress_hud.dart';
// 这个package用于加载页面较慢时,显示一个转圈
import 'package:http/http.dart' as http;
import 'dart:convert';
List<Map> cityList;
var cityIndex;
void main() {
runApp(MyApp());
}
class MyApp extends StatelessWidget {
@override
Widget build(BuildContext context) {
return MaterialApp(
title: 'Flutter Web Scraping Dynamic Demo',
theme: ThemeData(
primarySwatch: Colors.blue,
visualDensity: VisualDensity.adaptivePlatformDensity,
),
home: MyHomePage(title: 'Flutter Web Scraping Dynamic Demo'),
);
}
}
class MyHomePage extends StatefulWidget {
MyHomePage({Key key, this.title}) : super(key: key);
final String title;
@override
_MyHomePageState createState() => _MyHomePageState();
}
class _MyHomePageState extends State<MyHomePage> {
String _cityInputValue;
String _strSearchTips = '';
bool _saving = false; // modal_progress_hud 转圈指示
ListView _searchPage() {
return ListView(
padding: const EdgeInsets.all(8),
children: <Widget>[
ListTile(
title: Text('Enter your city name:'),
subtitle: Text('(e.g. Vancouver)'),
),
TextField(
onChanged: (value) {
_cityInputValue = value;
},
// add a decorating border
decoration: InputDecoration(
contentPadding: EdgeInsets.all(10.0),
border: OutlineInputBorder(
borderRadius: BorderRadius.circular(15.0),
)),
),
Container(
margin:
const EdgeInsets.only(left: 120, right: 120, top: 30, bottom: 20),
child: RaisedButton(
onPressed: () {
_locationSearch();
},
child: const Text(
'SEARCH',
style: TextStyle(fontSize: 16),
),
),
),
Text(_strSearchTips, //用于显示警告信息
style: TextStyle(
color: Colors.brown,
)),
],
);
}
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(
title: Text(widget.title),
),
body: ModalProgressHUD(child: _searchPage(), inAsyncCall: _saving),
//加载时,显示转圈
);
} //widget build end
_locationSearch() async {
setState(() {
_strSearchTips = '';
_saving = true;
});
if (_cityInputValue != null) {
_cityInputValue = _cityInputValue.trim();
}
if (_cityInputValue == null || _cityInputValue == '') {
FocusScope.of(context).unfocus(); //撤掉键盘
setState(() {
_strSearchTips = '!!!Please enter a valid location name.';
_saving = false;
});
return;
}
var _searchURL =
'https://www.theweathernetwork.com/api/location/search?searchText=' +
_cityInputValue +
'&lat=&long=';
final response = await http.Client().get(Uri.parse(_searchURL));
if (response.statusCode == 200) { // 成功连接
setState(() {
_saving = false;
});
cityList = new List();
var jList = json.decode(response.body) as List;
jList.forEach((element) {
var mElement = Map<String, dynamic>.from(element ?? '');
if (mElement['type'] == 'city') {
cityList.add(mElement);
}
});
if (cityList.length == 0) {
FocusScope.of(context).unfocus();
setState(() {
_strSearchTips = '!!!No matching location found. Try another name.';
});
} else {
Navigator.push(
context,
MaterialPageRoute(builder: (context) => SelectResultRoute()),
);
}
} else {
// status code != 200 连接不成功
setState(() {
_saving = false;
});
throw Exception('Server busy. Try again later.');
}
} //_location search end
} //MyHomePage end
class SelectResultRoute extends StatelessWidget {
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(
title: Text("Flutter Web Scraping Dynamic Demo"),
),
body: Column(
children: <Widget>[
ListTile(
title: Text(
'Select your location below:',
style: TextStyle(color: Colors.brown),
),
),
Expanded(
child: ListView.builder(
padding: const EdgeInsets.all(8),
itemCount: cityList.length,
itemBuilder: (BuildContext context, int index) {
var city = cityList[index]['name'] +
' ' +
cityList[index]['provcode'] +
' ' +
cityList[index]['country'];
var colorIndex = ((index + 1) % 9) * 100; //背景颜色
return Card(
color: Colors.blue[(colorIndex == 0) ? 50 : colorIndex],
child: ListTile(
title: Text(city),
onTap: () {
cityIndex = index;
Navigator.push(
context,
MaterialPageRoute(
builder: (context) => CurrentWeatherRoute()),
);
},
));
}, //item builder
),
)
],
),
);
} //widget build
}
class CurrentWeatherRoute extends StatefulWidget {
@override
_CurrentWeatherState createState() => _CurrentWeatherState();
}
class _CurrentWeatherState extends State<CurrentWeatherRoute> {
var _weather1, _weather2, _weather3; //储存要显示的3个值
bool _saving = false;
@override
void initState() {
super.initState();
_getCurrentWeather();
}
Future _getCurrentWeather() async {
var placeCode = cityList[cityIndex]['code'] ?? '';
var _searchURL =
'https://weatherapi.pelmorex.com/api/v1/observation/placecode/' +
placeCode;
var response = await http.Client().get(Uri.parse(_searchURL));
if (response.statusCode != 200) {
throw Exception('Server busy. Try again later.');
} else {
Map<String, dynamic> mResponse = json.decode(response.body);
var observation =
Map<String, dynamic>.from(mResponse['observation'] ?? '');
var weatherCode =
Map<String, dynamic>.from(observation['weatherCode'] ?? '');
_weather1 = weatherCode['text'];
_weather2 = observation['temperature'].toString() +
'°' +
mResponse['display']['unit']['temperature'];
_weather3 = observation['feelsLike'].toString() +
'°' +
mResponse['display']['unit']['temperature'];
}
setState(() { //所有数据拿到手后,刷新页面
_saving = false;
});
}
Widget _result() {
return ListView(
children: <Widget>[
ListTile(
title: Text(cityList[cityIndex]['name'] +
' ' +
cityList[cityIndex]['provcode'] +
' ' +
cityList[cityIndex]['country']),
),
ListTile(title: Text('Current Weather:')),
ListTile(
title: Row(
children: <Widget>[
Text(_weather1 ?? ''), //得到数据前_weather1 是 null,
//所以暂时显示一下空白
Text(' '),
Text(_weather2 ?? ''),
Text(' Feels Like '),
Text(_weather3 ?? ''),
],
),
),
],
);
}
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(
title: Text("Flutter Web Scraping Dynamic Demo"),
),
body: ModalProgressHUD(child: _result(), inAsyncCall: _saving),
);
}
}






评论
发表评论