Break up the data in a database column of a record into multiple records
Break up the data in a database column of a record into multiple records
Azure SQL Server - we have a table like this:
MyTable:
ID Source ArticleText
-- ------ -----------
1 100 <nvarchar(max) field with unstructured text from media articles>
2 145 "
3 866 "
4 232 "
ID
column is the primary key and auto-increments on INSERTS
.
ID
INSERTS
I run this query to find the records with the largest data size in the ArticleText
column:
ArticleText
SELECT TOP 500
ID, Source, DATALENGTH(ArticleText)/1048576 AS Size_in_MB
FROM
MyTable
ORDER BY
DATALENGTH(ArticleText) DESC
We are finding that for many reasons both technical and practical, the data in the ArticleText
column is just too big in certain records. The above query allows me to look at a range of sizes for our largest records, which I'll need to know for what I'm trying to formulate here.
ArticleText
The feat I need to accomplish is, for all existing records in this table, any record whose ArticleText
DATALENGTH
is greater than X, break that record into X amount of records where each record will then contain the same value in the Source
column, but have the data in the ArticleText
column split up across those records in smaller chunks.
ArticleText
DATALENGTH
Source
ArticleText
How would one achieve this if the exact requirement was say, take all records whose ArticleText
DATALENGTH
is greater than 10MB, and break each into 3 records where the resulting records' Source
column value is the same across the 3 records, but the ArticleText
data is separated into three chunks.
ArticleText
DATALENGTH
Source
ArticleText
In essence, we would need to divide the DATALENGTH
by 3 and apply the first 1/3 of the text data to the first record, 2nd 1/3 to the 2nd record, and the 3rd 1/3 to the third record.
DATALENGTH
Is this even possible in SQL Server?
ArticleText
@HABO that's a great point about breaking work boundaries. We wouldn't care about a sentence being split up, but we'd certainly not want any words being split. The order of the resulting records does not matter.
– Stpete111
Sep 9 '18 at 3:13
Try starting with a recursive CTE that uses
CharIndex()
to find a space after some starting position, e.g. CharIndex( ' ', ArticleText, 9999990 )
, splitting the string at the blank and repeating until the string is consumed. It's not perfect since it doesn't recognize punctuation, other whitespace, ..., but it's a start. (And don't let those pesky thirds of a character keep you up at night.)– HABO
Sep 9 '18 at 3:45
CharIndex()
CharIndex( ' ', ArticleText, 9999990 )
1 Answer
1
You can use the following code to create a side table with the needed data:
CREATE TABLE #mockup (ID INT IDENTITY, [Source] INT, ArticleText NVARCHAR(MAX));
INSERT INTO #mockup([Source],ArticleText) VALUES
(100,'This is a very long text with many many words and it is still longer and longer and longer, and even longer and longer and longer')
,(200,'A short text')
,(300,'A medium text, just long enough to need a second part');
DECLARE @partSize INT=50;
WITH recCTE AS
(
SELECT ID,[Source]
,1 AS FragmentIndex
,A.Pos
,CASE WHEN A.Pos>0 THEN LEFT(ArticleText,A.Pos) ELSE ArticleText END AS Fragment
,CASE WHEN A.Pos>0 THEN SUBSTRING(ArticleText,A.Pos+2,DATALENGTH(ArticleText)/2) END AS RestString
FROM #mockup
CROSS APPLY(SELECT CASE WHEN DATALENGTH(ArticleText)/2 > @partSize
THEN @partSize - CHARINDEX(' ',REVERSE(LEFT(ArticleText,@partSize)))
ELSE -1 END AS Pos) A
UNION ALL
SELECT r.ID,r.[Source]
,r.FragmentIndex+1
,A.Pos
,CASE WHEN A.Pos>0 THEN LEFT(r.RestString,A.Pos) ELSE r.RestString END
,CASE WHEN A.Pos>0 THEN SUBSTRING(r.RestString,A.Pos+2,DATALENGTH(r.RestString)/2) END AS RestString
FROM recCTE r
CROSS APPLY(SELECT CASE WHEN DATALENGTH(r.RestString)/2 > @partSize
THEN @partSize - CHARINDEX(' ',REVERSE(LEFT(r.RestString,@partSize)))
ELSE -1 END AS Pos) A
WHERE DATALENGTH(r.RestString)>0
)
SELECT ID,[Source],FragmentIndex,Fragment
FROM recCTE
ORDER BY [Source],FragmentIndex;
GO
DROP TABLE #mockup
The result
+----+--------+---------------+---------------------------------------------------+
| ID | Source | FragmentIndex | Fragment |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 1 | This is a very long text with many many words and |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 2 | it is still longer and longer and longer, and |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 3 | even longer and longer and longer |
+----+--------+---------------+---------------------------------------------------+
| 2 | 200 | 1 | A short text |
+----+--------+---------------+---------------------------------------------------+
| 3 | 300 | 1 | A medium text, just long enough to need a second |
+----+--------+---------------+---------------------------------------------------+
| 3 | 300 | 2 | part |
+----+--------+---------------+---------------------------------------------------+
Now you have to update the existing line with the value at FragmentIndex=1
, while you have to insert the values of FragmentIndex>1
. Do this sorted by FragmentIndex and your IDENTITY
ID-column will reflect the correct order.
FragmentIndex=1
FragmentIndex>1
IDENTITY
Will try this tomorrow and revert with any questions/issues. Thanks Shnugo!
– Stpete111
Sep 9 '18 at 18:06
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
So you don't care about breaking on word boundaries? Do you plan on adding a column to store the order of the
ArticleText
segments or don't you care about reassembling them in any particular order?– HABO
Sep 9 '18 at 3:02