neo4j 使用 官方 ETL load-csv 导入 数据

Neo4j load csv 使用

牛刀小试

下面的例子是 把 http://data.neo4j.com/examples/person.csv 的csv文件导入到neo4j数据库里,指定的csv文件分隔符是,, 只导入第一个字段不为null的数据,根据id为唯一主键导入,导入时设置name为第二个字段,每10000条提交一次。

1
2
3
4
5
6
7
USING PERIODIC COMMIT 10000
LOAD CSV FROM "http://data.neo4j.com/examples/person.csv" AS line
fieldterminator ','
WHERE line[0] IS NOT NULL
MERGE (n:Person {id: toInt(line[0])})
SET n.name = line[1]
RETURN n


1、 LOAD CSV FROM "http://data.neo4j.com/examples/person.csv" 是读取文件,文件路径可以是url、可以是本地相对路径、本地绝对路径
2、 fieldterminator 是设置分隔符
3、 MERGE 是插入节点 (不存在则插入,存在则更新)


导入步骤

1、处理数据,确保数据是UTF8无BOM编码,确保数据不存在字符转义问题,确保不存在脏数据
2、检查数据一共有多少行。 如果行数和实际的有差别,检查数据或者cypher语句
3、抽样检查数据。 返回前10条。看格式是否正确,字段是否对应。是否有乱码。
4、导入数据。

如果特别熟悉了,可以直接进行第4步导入数据,如果不太确定,建议还是1-4步都执行一遍。


(1) 导入前注意事项

1、 csv一定要用 UTF-8无BOM编码,否则入库时可能乱码
2、 默认使用,作为分隔符,如果想自定义,可以通过 fieldterminator 设置
3、 导入数据前,一定要校验数据,都是坑呀,总结出的经验
4、 建议使用绝对路径,使用绝对路径时需要把 neo4j.conf配置文件里的dbms.directories.import=import注释掉(在前面加个#号)


(2) 检查csv一共多少行

注意文件路径

(2.1) windows下相对路径方式

Test.csv 放在 ${neo4j_home}/import/Test.csv

1
2
3
LOAD CSV FROM "file:/Test.csv" 
AS line
RETURN COUNT(*);

(2.2) windows下绝对路径方式

文件放在 C:/User/wdb/2017-04-06_test.csv

1
2
3
LOAD CSV FROM  "file:///C:/User/wdb/2017-04-06_test.csv" 
AS line
RETURN COUNT(*);

(2.3) linux下相对路径格式

Test.csv 放在 ${neo4j_home}/import/Test.csv

1
2
3
LOAD CSV FROM "file:/2017-04-06_test.csv" 
AS line
RETURN COUNT(*);

(2.4) linux下绝对路径格式

1
2
3
LOAD CSV FROM "file:/home/wkq/databases/data/2017-04-06_test.csv" 
AS line
RETURN COUNT(*);


(3) 校验数据

注意文件路径
文件使用UTF-8编码,否则可能导致乱码

(3.1)  检查数据 不带标题

1
2
3
4
5
LOAD CSV FROM "file:///C:/User/wdb/Test.csv" 
AS line
WITH line
RETURN line
LIMIT 5

(3.2) 检查数据 带标题

1
2
3
4
5
LOAD CSV WITH HEADERS FROM "file:///C:/User/wdb/Test.csv" 
AS line
WITH line
RETURN line
LIMIT 5


(4) 真正导入数据

(4.1) 导入数据 csv文件带文件头

注意WITH HEADERS的写法和不用HEADERS的写法,一个用的是 line.name 一个用的是 line[0]
导入时可以使用函数 toInt(‘1’) toFloat(‘1.0’)
toInteger(), toFloat(), split()

1
2
3
4
LOAD CSV WITH HEADERS FROM "file:///C:/User/wdb/Test.csv" 
AS line
WITH line
CREATE (:Person {name:line.name, number:toInt(line.number), tel:toInt(line.tel), sex:line.sex, class:line.classmate}) ;

(4.2) 导入数据 csv文件不带文件头

1
2
3
4
LOAD CSV FROM "file:///C:/User/wdb/Test.csv" 
AS line
WITH line
CREATE (:Person {name:line[0], number:toInt(line[1]), tel:toInt(line[2]), sex:line[3], class:line[4]}) ;


(5) 注事事项

(5.1) 批量提交

数据量大了以后可以使用批量提交

using periodic commit 10000 表示 每10000行进行一次事务提交

1
2
3
4
5
6
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (pet:Pet {petId: row.PetId})
MERGE (owner:Owner {ownerId: row.OwnerId})
ON CREATE SET owner.name = row.OwnerName
MERGE (pet)-[r:OWNED_BY]->(owner)

(5.2) 处理空值

处理空值

1
2
3
4
5
6
7
8
9
10
11
12
13
//skip null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row WHERE row.Company IS NOT NULL
MERGE (c:Company {companyId: row.Id})

//set default for null values
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id, hqLocation: coalesce(row.Location, "Unknown")})

//change empty strings to null values (not stored)
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.Id})
SET c.emailAddress = CASE trim(row.Email) WHEN "" THEN null ELSE row.Email END

(5.3) 使用split()

1
2
3
4
5
6
//split string of employee skills into separate nodes
LOAD CSV FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.Id})
UNWIND split(row.skills, ',') AS skill
MERGE (s:Skill {name: skill})
MERGE (e)-[r:HAS_EXPERIENCE]->(s);

(5.4) 使用CASE

1
2
3
4
5
6
7
8
9
10
11
//set businessType property based on shortened value in CSV
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
WITH row,
(CASE row.BusinessType
WHEN 'P' THEN 'Public'
WHEN 'R' THEN 'Private'
WHEN 'G' THEN 'Government'
ELSE 'Other' END) AS type
MERGE (c:Company {companyId: row.CompanyId})
SET c.businessType = type
RETURN *

(5.5) MERGE relation

为获得最佳性能,请始终在具有索引主键属性的单个标签上使用MATCH和MERGE。
您还应该将节点和关系创建分离为单独的语句。

以下两种写法都没问题,第一种相对第二种更占内存,第二种在内存有限的情况下性能会好一点

1
2
3
MERGE (e:Employee {employeeId: row.employeeId})
MERGE (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)

官方推荐在具体的Label的属性上建立索引使用MATCH和MERGE,并且把创建节点和创建关系分开写

1
2
3
4
5
6
7
8
9
10
11
12
13
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (e:Employee {employeeId: row.employeeId})
RETURN count(e);

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MERGE (c:Company {companyId: row.companyId})
RETURN count(c);

LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row
MATCH (e:Employee {employeeId: row.employeeId})
MATCH (c:Company {companyId: row.companyId})
MERGE (e)-[r:WORKS_FOR]->(c)
RETURN count(*);

官方原文

To improve inserting or updating unique entities into your graph (using MERGE or MATCH with updates), you can create indexes and constraints declared for each of the labels and properties you plan to merge or match on.

For best performance, always MATCH and MERGE on a single label with the indexed primary-key property.
You should also separate node and relationship creation into separate statements.

This way, the load is only doing one piece of the import at a time and can move through large amounts of data quickly and efficiently, reducing heavy processing.

When the amount of data being loaded is too much to fit into memory, there are a couple of different approaches you can use to combat running out of memory during the data load.


(6) 性能测试

测试1

1
2
3
4
5
6
7
8
neo4j-sh (?)$ using periodic commit 10000 load csv with headers from "file:/data/stale/data01/neo4j/node_uuid_10w.csv" as line with line create (:Test {uuid:line.uuid, name:line.name});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 100000
Properties set: 200000
Labels added: 100000
3412 ms

测试2

/data/stale/data01/neo4j/node_uuid_1kw.csv文件加上标题一共10000010条,有10000009条数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
neo4j-sh (?)$ load csv from "file:/data/stale/data01/neo4j/node_uuid_1kw.csv" as line return count(*);
+----------+
| count(*) |
+----------+
| 10000010 |
+----------+
1 row
7434 ms
neo4j-sh (?)$ using periodic commit 10000 load csv with headers from "file:/data/stale/data01/neo4j/node_uuid_1kw.csv" as line with line create (:Test {uuid:line.uuid, name:line.name});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 10000009
Properties set: 20000018
Labels added: 10000009
151498 ms

服务器32核,256G内存,1T机械硬盘
导入时CPU利用率在150%左右,RES Memory 5G左右,VIRT Memory 70G左右

(7) 可能遇到的问题

(7.1) Neo.DatabaseError.General.UnknownError

Neo.DatabaseError.General.UnknownError: At /home/usdp/databases/data/news.csv:172393 - there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: '最后一公里"'
字符转义的问题,换一个分隔符或者把那条数据去掉

在修改语句后 load csv from "file:/home/usdp/databases/data/news.csv" as line fieldterminator ';' return count(*);

数据格式问题
在生成数据或者处理数据这步没做好,没有好的办法,只能把有问题的数据处理掉

(7.2) Couldn’t load the external resource

WARNING: Couldn't load the external resource at: file:/home/usdp/databases/data/news.csv
路径输错了或者没有权限

(7.3) WARNING: Invalid input

WARNING: Invalid input ')': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or '}'
cypher语句写的有问题

(7.4) WARNING: Variable n not defined

cypher语句写的有问题,找不见变量n

(7.5) WARNING: Expected 0 to be a java.lang.String, but it was a java.lang.Long

load csv时使用header就在cypher里用header的属性
或者load csv时不使用header,然后用line[0], line[1]
否则会出现上面的错误

(8) 官网英文资料

LOAD CSV非常适合导入中小型数据,大概1000 0000 左右数据。
对于大数据集,即在100B记录范围内,我们可以访问专门的批量导入程序。

LOAD CSV is great for importing small – medium sized data, i.e. up to the 10M records range. For large data sets, i.e. in the 100B records range, we have access to a specialized bulk importer.

CSV Data Quality
Real World Data Considerations

Real world data is messy. Don’t assume what someone told you is in a CSV file is actually, in there, don’t rely on format descriptions, consistency or correct quoting. Only trust data validity that you checked yourself.
Common Pitfalls

BOM byte order mark (2 UTF-8) bytes at the beginning of a file ← remove them
Binary zeros or other non-text-characters dispersed throughout the file ← remove them
Inconsisent line breaks – mixed Windows and Unix linebreaks ← make sure they are consistent, best choose Unix style
Header inconsistent with data (missing, too many columns, different delimiter in header) ← fix headers
Special character in non-quoted text ← make sure unusual text is always quoted
Unexpected newlines in quoted and unquoted text-fields ← either quote text or remove newlines
stray quotes – standalone double or single quote in the middle of non-quoted text, or non-escaped quotes in quoted text ← escape or remove stray quotes

// assert correct line count
LOAD CSV FROM “file-url” AS line
RETURN count(*);

// check first few raw lines
LOAD CSV FROM “file-url” AS line WITH line
RETURN line
LIMIT 5;

// check first 5 line-sample with header-mapping
LOAD CSV WITH HEADERS FROM “file-url” AS line WITH line
RETURN line
LIMIT 5;

LOAD CSV for Medium Sized Datasets
The real secret of LOAD CSV.

It is not just your basic data ingestion mechanism, but actually an ETL Power Tool. Why?

It combines multiple aspects in a single operation:

supports loading / ingesting CSV data from an URI
direct mapping of input data into complex graph/domain structure
data conversion
supports complex computations
create or merge data, relationships and structure

Important Tips for LOAD CSV

Always use the latest version of Neo4j, it will most probably be faster than earlier ones.

Data Quality and Conversion

See the data quality section above
Empty fields have to be skipped or replaced with default values during LOAD CSV
All data from the CSV file is read as a string, you have to use toInt, toFloat, split or similar functions to convert
Split arrays in a cell by delimiter using split (combine with extract for conversions)
Check your Cypher import statement for typos: labels, property names and relationship-types are case-sensitive
Conditional conversions can be achieved with CASE

Indexing and Performance

Make sure to have indexes and constraints declared and ONLINE for entities you want to MATCH or MERGE on
Always MATCH and MERGE on a single label and the indexed primary-key property
Prefix your load statements with USING PERIODIC COMMIT 10000
If possible, separate node creation from relationship creation into different statements
If your import is slow or runs into memory issues, see Mark’s blog post on Eager loading.

Memory Config

Make sure to have enough memory (at least 4G heap in neo4j-wrapper.conf) reserved for your Neo4j-Server or Neo4j-Shell (export JAVA_OPTS=”-Xmx4G”)
Configure the memory mapping settings (Neo4j 2.1) according to your expected file sizes
(Neo4j 2.1 neo4j.properties keep these ratios: nodestore=100M, relationshipstore=2G, propertystore=500M, stringstore=500M)

File-URLs and Neo4j-Shell

Make sure to use the right URLs esp. file URLs.+ On OSX and Unix use file:///path/to/data.csv, on Windows, please use file:c:/path/to/data.csv
Use the bin/neo4j-shell instead of the browser for better error messages and control
by default it connects to a running server, but you can also use bin/neo4j-shell -path import.db -config conf/neo4j.properties for direct database directory access (when no server is running with that db).

Step by Step Example for LOAD CSV

In our guide on ETL import from a relational database we explain how to import CSV data step by step, from data modeling, creating indexes to writing the individual LOAD CSV statements.
Webinar “LOAD CSV in the Real World”

In this very hands-on webinar Nicole White, Neo Technology’s Data Scientist, shows how to use LOAD CSV to import a real world dataset (consumer complaints from consumerfinance.gov) into Neo4j.

After a quick modeling discussion she walks through the steps of preparing indexes and constraints and then imports one part of the dataset at a time into Neo4j.

Super Fast Batch Importer For Huge Datasets

LOAD CSV is great for importing small – medium sized data, i.e. up to the 10M records range. For large data sets, i.e. in the 100B records range, we have access to a specialized bulk importer.

References

[1] neo4j office guide-import-csv
[2] neo4j-cypher-avoiding-the-eager
[3] using-load-csv-to-import-git-history-into-neo4j
[4] load-csv-into-neo4j-quickly-and-successfully
[5] how-to-use-a-csv-field-to-define-the-node-label-in-a-load-statement
[6] how-to-load-csv-files-into-spss-variable-and-value-labels