博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
WEB数据挖掘(十六)——Aperture数据抽取(9):数据源
阅读量:6911 次
发布时间:2019-06-27

本文共 2876 字,大约阅读时间需要 9 分钟。

One of the central concepts of Aperture is the notion of a DataSource. A DataSource contains all information necessary to locate the individual information resources in a physical source. For example, a FileSystemDataSource holds a root directory, a set of patterns that describe what files to include or exclude, a maximum depth, etc., thereby effectively describing a set of files.

One of the main purposes of a DataSource is to hold all data needed by a to crawl the physical source and retrieve all the individual resources in it. There are quite a few DataSource subclasses in Aperture. The following diagram contains a selection of them.

The specific DataSource implementations available at the moment contain specific 'get' and 'set' methods for the configuration properties accepted by the data source. Thus providing a convenient interface and abstracting from the underlying RDF properties. All configuration data is stored in a RDFContainer. Each data source type comes with it's own specific properties. There is also a set of generic properties used by many data source types (username, password etc.). You can have a look at the source code of the DataSource implementation class of your choosing to see which properties are used. Note that the data source classes are not stored in the SVN. They are generated automatically from an RDF file with the description of the class. (like ). The classes are generated by a , by adding appropriate entries in the datasource module pom.xml file similar to . If you'd like to develop your own data source implementation, try to mimic the existing implementations or ask at the aperture-devel for help.

It is worth mentioning, that DataSource classes only DESCRIBE a data source. They don't contain any resources that would enable direct access to the source (such as InputStreams, or Readers, whatever...). (At least it was not the intention of the designers). Any such resource is obtained by the crawler at the start of crawl and may be encapsulated in a DataObject returned by an Accessor or crawler. The following code demonstrates how to create and configure a FileSystemDataSource:

// determine the root folder of the sourceFile rootFolder = new File("D:\\path\\to\\the\\root\\folder");// create the model that will store the data source configureModel model = RDF2Go.getModelFactory().createModel();// don't forget to open it before usemodel.open();// determine a URI to identify the DataSourceURI id = model.createURI("urn:test:testsource");// wrap the model in an RDFContainerRDFContainer configuration = new RDFContainerImpl(model,id);// create the DataSource instanceFileSystemDataSource source = new FileSystemDataSource();// set the configuration (it is empty at the moment)source.setConfiguration(configuration)// and set the rootFolder (you can do it now)source.setRootFolder(rootFolder.getAbsolutePath());

转载地址:http://mxycl.baihongyu.com/

你可能感兴趣的文章
Excel导出数据
查看>>
解释Windows7“上帝模式”的原理
查看>>
httpClient4.* 使用教程
查看>>
相对和绝对路径、cd命令、创建和删除目录mkdir/rmdir 、rm命令
查看>>
yum安装配置nagios
查看>>
linux下Bash局部变量及信号捕捉等概念解释
查看>>
HTML5 input placeholder 颜色修改示例css
查看>>
cacti-0.8.8c 下安装realtime插件
查看>>
我的友情链接
查看>>
从0开始学大数据-Java基础开篇(1)
查看>>
github常用命令总结(一)
查看>>
Intent(意图)
查看>>
Exchange Server 2007迁移Exchange Server 2010 (2) ---前期准备之二
查看>>
翻译:Fast dynamic extracted honeypots in cloud computing --5.CONCLUSION
查看>>
Effective C++: constexpr(during compilation).
查看>>
TCP/IP协议三次握手流程
查看>>
了解Oracle内核代码层的作用
查看>>
我的友情链接
查看>>
我的友情链接
查看>>
Java学习笔记1-初始化的顺序
查看>>