I want to import a list of international place names and their locations into a SQL Server database using ogr2ogr.
The source data is a CSV file encoded in UTF-8.
I've prepared a five-record sample file called london_points.csv
. It contains the records for London in Arabic, Czech, English. Russian, and Chinese. It looks like this:
LanguageCode,Name,Longitude,Latitude
"ar","لندن","-0.143555","51.577222"
"cz","Londýn","-0.143555","51.577222"
"en","London","-0.143555","51.577222"
"ru","Лондон","-0.143555","51.577222"
"zw","倫敦","-0.143555","51.577222"
To import the CSV file using ogr2ogr I created a VRT file called london_points.vrt
. It tells ogr2ogr to import the data and derive a point column from the Longitude and Latitude columns. It looks like this:
london_points.csv
wkbPoint
WGS84
ogr2ogr apparently successfully imports all five records into tempdb:
$ ogr2ogr --DEBUG ON -s_srs EPSG:4326 -lco GEOM_TYPE=geography -overwrite -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes" london_points.vrt
OGR: OGROpen(london_points.vrt/014AFF40) succeeded as VRT.
OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes")
ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes/0338CFF0) succeeded as MSSQLSpatial.
OGR: OGROpen(london_points.csv/033A41C8) succeeded as CSV.
MSSQLSpatial: DeleteLayer(london_points)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR2OGR: 5 features written in layer 'london_points'
ODBC: SQLDisconnect()
VRT: 5 features read on layer 'london_points'.
CSV: 5 features read on layer 'london_points'.
I use a query like this to inspect the results in the database:
SELECT languagecode, name, longitude, latitude
FROM tempdb.dbo.london_points;
The result set looks like this:
languagecode name longitude latitude
------------ ------------ --------- ---------
ar لندن -0.143555 51.577222
cz Londýn -0.143555 51.577222
en London -0.143555 51.577222
ru Лондон -0.143555 51.577222
zw 倫敦 -0.143555 51.577222
The result set contains the correct number of rows. The languagecode
, longitude
, and latitude
columns, which contain only ASCII characters, contain the correct values.
But all the non-ASCII characters in the name
column are mangled!
The main ogr2ogr documentation mentions no switch that controls character encoding.
The CSV driver documentation states only that "All CSV files are treated as UTF-8 encoded."
Is this a limitation of ogr2ogr, or am I missing some magic switch somewhere?
Is there a problem with the GDAL SQL Server driver configuration?
Andre Joost asked: "Maybe it is a problem at the GDAL SQL Server driver site. Have you looked at msdn.microsoft.com/en-us/library/ms130822.aspx, AutoTranslate section?"
The two documented values for AutoTranslate
are yes
and no
. It's not explict, but I think yes
is the default value.
SQL Server automatically maps the code points for extended characters between different ANSI (single-byte) character sets. If you disable the translation behaviour, character data could get mangled if the client and the server are using different code pages.
The stored data type is Unicode (nvarchar) rather than ANSI (varchar) so I don't think setting would have an effect here.
I'll import the data once with translation explicitly disabled and again with it explicitly enabled to see if there is a difference.
In this example, the connection string contains autotranslate=no
:
$ ogr2ogr --DEBUG ON -s_srs EPSG:4326 -lco GEOM_TYPE=geography -overwrite -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=no" london_points.vrt
OGR: OGROpen(london_points.vrt/012AFF40) succeeded as VRT.
OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes;autotranslate=no")
ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes;autotranslate=no)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=no/0335E048) succeeded as MSSQLSpatial.
OGR: OGROpen(london_points.csv/033732D8) succeeded as CSV.
MSSQLSpatial: DeleteLayer(london_points)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR2OGR: 5 features written in layer 'london_points'
ODBC: SQLDisconnect()
VRT: 5 features read on layer 'london_points'.
CSV: 5 features read on layer 'london_points'.
SELECT languagecode, name, longitude, latitude
FROM tempdb.dbo.london_points;
languagecode name longitude latitude
------------ ------------ --------- ---------
ar لندن -0.143555 51.577222
cz Londýn -0.143555 51.577222
en London -0.143555 51.577222
ru Лондон -0.143555 51.577222
zw 倫敦 -0.143555 51.577222
In this example the connection string contains autotranslate=yes
:
$ ogr2ogr --DEBUG ON -s_srs EPSG:4326 -lco GEOM_TYPE=geography -overwrite -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=yes" london_points.vrt
OGR: OGROpen(london_points.vrt/02DCFF40) succeeded as VRT.
OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes;autotranslate=yes")
ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes;autotranslate=yes)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes;autotranslate=yes/0344E048) succeeded as MSSQLSpatial.
OGR: OGROpen(london_points.csv/034632D8) succeeded as CSV.
MSSQLSpatial: DeleteLayer(london_points)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR2OGR: 5 features written in layer 'london_points'
ODBC: SQLDisconnect()
VRT: 5 features read on layer 'london_points'.
CSV: 5 features read on layer 'london_points'.
SELECT languagecode, name, longitude, latitude
FROM tempdb.dbo.london_points;
languagecode name longitude latitude
------------ ------------ --------- ---------
ar لندن -0.143555 51.577222
cz Londýn -0.143555 51.577222
en London -0.143555 51.577222
ru Лондон -0.143555 51.577222
zw 倫敦 -0.143555 51.577222
The results are identical.
What SQL statements does ogr2ogr send to server?
I traced the import using SQL Server Profiler to capture the SQL statements that ogr2ogr uses to insert data in to the london_points
table.
The trace captured these insert statements:
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ar', 'لندن', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'cz', 'Londýn', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'en', 'London', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ru', 'Лондон', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'zw', '倫敦', '-0.143555', '51.577222');
go
For clarity I split each INSERT VALUES statement across two lines.
Inspecting the SQL reveals two problems.
ogr2ogr mangles the values for the name
column before they reach the server. This proves the problem lies with ogr2ogr.
ogr2ogr passes the values as varchar literals ('abc'
) rather than nvarchar literals (N'abc'
). Even if the query text contained the correct characters, the server would replace with a question mark any character that don't exist in the default code page.
This query shows the difference:
SELECT 'لندن' AS varchar_value, N'لندن' AS nvarchar_value
UNION ALL
SELECT 'Londýn', N'Londýn'
UNION ALL
SELECT 'London', N'London'
UNION ALL
SELECT 'Лондон', N'Лондон'
UNION ALL
SELECT '倫敦', N'倫敦';
varchar_value nvarchar_value
------------- --------------
???? لندن
Londýn Londýn
London London
?????? Лондон
?? 倫敦
My default code page is Windows-1252. It contains characters for English and Czech, so varchar string literals are fine for these languages. Arabic, Russian, and Traditional Chinese supported only in Unicode, so all the characters are replaced by question marks.
I believe now that this is a fault in the OGR MSSQLSpatial driver.
Answer
I have same problem, my understanding is that the stored data is correct, if you consider it a binary utf-8 encoded data, to convert this data to nvarchar i wrote this (partially tested) UDF function in sql server:
CREATE FUNCTION varbinUTF8_to_nvarchar ( @data VARBINARY(MAX) )
RETURNS NVARCHAR(MAX)
AS
BEGIN
DECLARE @n INT
DECLARE @buffer NVARCHAR(MAX)
SET @buffer = N''
SET @n = DATALENGTH(@data)
DECLARE @i INT
SET @i = 0
DECLARE @byte INT
DECLARE @byte2 INT
DECLARE @byte3 INT
DECLARE @byte4 INT
DECLARE @decoded NCHAR
WHILE @i < @n
BEGIN
SET @i = @i + 1
SET @byte = SUBSTRING(@data, @i, 1)
IF ( @byte < 0x7f )
SET @decoded = NCHAR(@byte)
ELSE
IF @byte & 0xe0 = 0xc0
BEGIN
SET @i = @i + 1
SET @byte2 = SUBSTRING(@data, @i, 1)
SET @decoded = NCHAR(( ( @byte & 0x1F ) * 0x40 )
+ ( @byte2 & 0x3f ))
END
ELSE
IF @byte & 0xf0 = 0xe0
BEGIN
SET @i = @i + 1
SET @byte2 = SUBSTRING(@data, @i, 1)
SET @i = @i + 1
SET @byte3 = SUBSTRING(@data, @i, 1)
SET @decoded = NCHAR(( ( @byte & 0xF )
* 0x1000 ) + ( ( @byte2
& 0x3f ) * 0x40 )
+ ( (@byte3 & 0x3f) ))
END
ELSE
IF @byte & 0xf8 = 0xf0
BEGIN
SET @i = @i + 1
SET @byte2 = SUBSTRING(@data, @i, 1)
SET @i = @i + 1
SET @byte3 = SUBSTRING(@data, @i, 1)
SET @i = @i + 1
SET @byte4 = SUBSTRING(@data, @i, 1)
/*
EDIT 2014-11-27 THIS PART WERE WRONG
SET @decoded = NCHAR(( ( @byte & 0xF )
* 0x1000 )
+ ( ( @byte2 & 0x3f )
* 0x40 )
+ ( ( @byte3 & 0x3f )
* 0x40 )
+ ( (@byte4 & 0x3f) ))
FOLLOWS PATCH, but sql server does not support unicode chars > 16 bit
*/
SET @decoded = NCHAR(( ( @byte & 0x7 )
* 0x40000)
+ ( ( @byte2 & 0x3f )
* 0x1000 )
+ ( ( @byte3 & 0x3f )
* 0x40 )
+ ( (@byte4 & 0x3f) ))
END
ELSE
BEGIN
return N'BAD UTF-8 MESSAGE'
END
SET @buffer = @buffer + @decoded
END
return @buffer
END
Then i can convert the wrong formatted field using something like:
SELECT
dbo.varbinUTF8_to_nvarchar(CONVERT(VARBINARY(MAX), name_1) )
FROM
dbo.che_adm1
In your specific case you should be able to get nvarchar data doing :
SELECT languagecode, dbo.varbinUTF8_to_nvarchar(CONVERT(VARBINARY(MAX), name) as name, longitude, latitude
FROM tempdb.dbo.london_points;
Best Regards Andrea
No comments:
Post a Comment