Spark - Reading partitioned data from S3 - how does partitioning happen?










1















When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -

Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?

Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?



Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?










share|improve this question




























    1















    When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -

    Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?

    Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?



    Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?










    share|improve this question


























      1












      1








      1








      When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -

      Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?

      Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?



      Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?










      share|improve this question
















      When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -

      Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?

      Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?



      Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?







      apache-spark amazon-s3






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 '18 at 15:53









      thebluephantom

      2,7203927




      2,7203927










      asked Nov 11 '18 at 9:24









      user976850user976850

      4411616




      4411616






















          1 Answer
          1






          active

          oldest

          votes


















          2














          Data on S3 is external to HDFS obviously.



          You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.



          If you use:



          val df = spark.read.parquet("/path/to/parquet/file.../...")


          then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.



          But, this:



          val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")


          will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.



          The Driver only gets the metadata if specifying S3 directly.



          In your terms:



          • "... each executor downloads the data directly (on the worker node)? " YES

          • Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.





          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247371%2fspark-reading-partitioned-data-from-s3-how-does-partitioning-happen%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            Data on S3 is external to HDFS obviously.



            You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.



            If you use:



            val df = spark.read.parquet("/path/to/parquet/file.../...")


            then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.



            But, this:



            val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")


            will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.



            The Driver only gets the metadata if specifying S3 directly.



            In your terms:



            • "... each executor downloads the data directly (on the worker node)? " YES

            • Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.





            share|improve this answer





























              2














              Data on S3 is external to HDFS obviously.



              You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.



              If you use:



              val df = spark.read.parquet("/path/to/parquet/file.../...")


              then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.



              But, this:



              val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")


              will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.



              The Driver only gets the metadata if specifying S3 directly.



              In your terms:



              • "... each executor downloads the data directly (on the worker node)? " YES

              • Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.





              share|improve this answer



























                2












                2








                2







                Data on S3 is external to HDFS obviously.



                You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.



                If you use:



                val df = spark.read.parquet("/path/to/parquet/file.../...")


                then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.



                But, this:



                val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")


                will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.



                The Driver only gets the metadata if specifying S3 directly.



                In your terms:



                • "... each executor downloads the data directly (on the worker node)? " YES

                • Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.





                share|improve this answer















                Data on S3 is external to HDFS obviously.



                You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.



                If you use:



                val df = spark.read.parquet("/path/to/parquet/file.../...")


                then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.



                But, this:



                val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")


                will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.



                The Driver only gets the metadata if specifying S3 directly.



                In your terms:



                • "... each executor downloads the data directly (on the worker node)? " YES

                • Metadata is gotten in some way with Driver coordination and other system components for file / directory locations on S3, but not that the data is first downloaded to Driver - that would be a big folly in design. But it depends also on format of statement how the APIs respond.






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 12 '18 at 13:55

























                answered Nov 12 '18 at 13:42









                thebluephantomthebluephantom

                2,7203927




                2,7203927



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247371%2fspark-reading-partitioned-data-from-s3-how-does-partitioning-happen%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

                    Edmonton

                    Crossroads (UK TV series)