Break up the data in a database column of a record into multiple records

Azure SQL Server - we have a table like this:

MyTable:

ID Source ArticleText -- ------ ----------- 1 100 <nvarchar(max) field with unstructured text from media articles> 2 145 " 3 866 " 4 232 "

ID column is the primary key and auto-increments on INSERTS.

ID

INSERTS

I run this query to find the records with the largest data size in the ArticleText column:

ArticleText

SELECT TOP 500 ID, Source, DATALENGTH(ArticleText)/1048576 AS Size_in_MB FROM MyTable ORDER BY DATALENGTH(ArticleText) DESC

We are finding that for many reasons both technical and practical, the data in the ArticleText column is just too big in certain records. The above query allows me to look at a range of sizes for our largest records, which I'll need to know for what I'm trying to formulate here.

ArticleText

The feat I need to accomplish is, for all existing records in this table, any record whose ArticleText DATALENGTH is greater than X, break that record into X amount of records where each record will then contain the same value in the Source column, but have the data in the ArticleText column split up across those records in smaller chunks.

ArticleText

DATALENGTH

Source

ArticleText

How would one achieve this if the exact requirement was say, take all records whose ArticleText DATALENGTH is greater than 10MB, and break each into 3 records where the resulting records' Source column value is the same across the 3 records, but the ArticleText data is separated into three chunks.

ArticleText

DATALENGTH

Source

ArticleText

In essence, we would need to divide the DATALENGTH by 3 and apply the first 1/3 of the text data to the first record, 2nd 1/3 to the 2nd record, and the 3rd 1/3 to the third record.

DATALENGTH

Is this even possible in SQL Server?

So you don't care about breaking on word boundaries? Do you plan on adding a column to store the order of the ArticleText segments or don't you care about reassembling them in any particular order?

– HABO
Sep 9 '18 at 3:02

ArticleText

@HABO that's a great point about breaking work boundaries. We wouldn't care about a sentence being split up, but we'd certainly not want any words being split. The order of the resulting records does not matter.

– Stpete111
Sep 9 '18 at 3:13

Try starting with a recursive CTE that uses CharIndex() to find a space after some starting position, e.g. CharIndex( ' ', ArticleText, 9999990 ), splitting the string at the blank and repeating until the string is consumed. It's not perfect since it doesn't recognize punctuation, other whitespace, ..., but it's a start. (And don't let those pesky thirds of a character keep you up at night.)

– HABO
Sep 9 '18 at 3:45

CharIndex()

CharIndex( ' ', ArticleText, 9999990 )

1 Answer
1

You can use the following code to create a side table with the needed data:

CREATE TABLE #mockup (ID INT IDENTITY, [Source] INT, ArticleText NVARCHAR(MAX)); INSERT INTO #mockup([Source],ArticleText) VALUES (100,'This is a very long text with many many words and it is still longer and longer and longer, and even longer and longer and longer') ,(200,'A short text') ,(300,'A medium text, just long enough to need a second part'); DECLARE @partSize INT=50; WITH recCTE AS ( SELECT ID,[Source] ,1 AS FragmentIndex ,A.Pos ,CASE WHEN A.Pos>0 THEN LEFT(ArticleText,A.Pos) ELSE ArticleText END AS Fragment ,CASE WHEN A.Pos>0 THEN SUBSTRING(ArticleText,A.Pos+2,DATALENGTH(ArticleText)/2) END AS RestString FROM #mockup CROSS APPLY(SELECT CASE WHEN DATALENGTH(ArticleText)/2 > @partSize THEN @partSize - CHARINDEX(' ',REVERSE(LEFT(ArticleText,@partSize))) ELSE -1 END AS Pos) A UNION ALL SELECT r.ID,r.[Source] ,r.FragmentIndex+1 ,A.Pos ,CASE WHEN A.Pos>0 THEN LEFT(r.RestString,A.Pos) ELSE r.RestString END ,CASE WHEN A.Pos>0 THEN SUBSTRING(r.RestString,A.Pos+2,DATALENGTH(r.RestString)/2) END AS RestString FROM recCTE r CROSS APPLY(SELECT CASE WHEN DATALENGTH(r.RestString)/2 > @partSize THEN @partSize - CHARINDEX(' ',REVERSE(LEFT(r.RestString,@partSize))) ELSE -1 END AS Pos) A WHERE DATALENGTH(r.RestString)>0 ) SELECT ID,[Source],FragmentIndex,Fragment FROM recCTE ORDER BY [Source],FragmentIndex; GO DROP TABLE #mockup

The result

+----+--------+---------------+---------------------------------------------------+ | ID | Source | FragmentIndex | Fragment | +----+--------+---------------+---------------------------------------------------+ | 1 | 100 | 1 | This is a very long text with many many words and | +----+--------+---------------+---------------------------------------------------+ | 1 | 100 | 2 | it is still longer and longer and longer, and | +----+--------+---------------+---------------------------------------------------+ | 1 | 100 | 3 | even longer and longer and longer | +----+--------+---------------+---------------------------------------------------+ | 2 | 200 | 1 | A short text | +----+--------+---------------+---------------------------------------------------+ | 3 | 300 | 1 | A medium text, just long enough to need a second | +----+--------+---------------+---------------------------------------------------+ | 3 | 300 | 2 | part | +----+--------+---------------+---------------------------------------------------+

Now you have to update the existing line with the value at FragmentIndex=1, while you have to insert the values of FragmentIndex>1. Do this sorted by FragmentIndex and your IDENTITY ID-column will reflect the correct order.

FragmentIndex=1

FragmentIndex>1

IDENTITY

Will try this tomorrow and revert with any questions/issues. Thanks Shnugo!

– Stpete111
Sep 9 '18 at 18:06

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt