मेरे पास 10GB का डेटा सेट आकार है (उदाहरण Test.txt)।

मैंने अपनी pyspark स्क्रिप्ट नीचे की तरह लिखी है (Test.py):

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("FilterProduct").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
lines = spark.read.text("C:/Users/test/Desktop/Test.txt").rdd
lines.collect()

फिर मैं नीचे दिए गए आदेश का उपयोग कर उपरोक्त स्क्रिप्ट निष्पादित कर रहा हूं:

spark-submit Test.py --executor-memory  12G 

तब मुझे नीचे की तरह त्रुटि मिल रही है:

17/12/29 13:27:18 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/Test.txt, range: 402653184-536870912, partition values: [empty row]
17/12/29 13:27:18 INFO CodeGenerator: Code generated in 22.743725 ms
17/12/29 13:27:44 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3230)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
        at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
        at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
17/12/29 13:27:44 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3230)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93

कृपया मुझे बताएं कि इसका समाधान कैसे करें?

14
Sai 29 पद 2017, 11:25
5
जब आप collect का उपयोग करते हैं, तो पूरी फ़ाइल परिणामों के लिए ड्राइवर नोड पर खींची जाती है, इसलिए यदि आप निष्पादक मेमोरी को 50GB पर सेट करते हैं तो इससे कोई फर्क नहीं पड़ता। इसे हटा दें और आपका कोड काम करेगा।
 – 
philantrovert
29 पद 2017, 11:38
मैंने अभी भी हटा दिया है मुझे वही त्रुटि मिल रही है।
 – 
Sai
29 पद 2017, 12:27
क्या आपने विभाजन निर्दिष्ट करके फ़ाइल को पढ़ने का प्रयास किया; कुछ इस तरह.. sc.textFile(file, numPartitions)
 – 
Usman Azhar
30 पद 2017, 19:18
हां मैंने कोशिश की लेकिन फिर भी मुझे वही त्रुटि मिल रही है।
 – 
Sai
1 जिंदा 2018, 10:34
मैंने 200Mb फ़ाइल भी आज़माई है। मुझे वही त्रुटि मिल रही है। कृपया इस पर मेरी मदद करें। मैं Windows10,16GB RAM, I3 प्रोसेसर पर pyspark का उपयोग कर रहा हूँ। और मैंने ड्राइवर और एक्ज़ीक्यूटर मेमोरी को भी बढ़ाकर कोशिश की थी। फिर भी मुझे उसी समस्या का सामना करना पड़ रहा है।
 – 
Sai
4 जिंदा 2018, 13:07

3 जवाब

आप --conf "spark.driver.maxResultSize=20g" आजमा सकते हैं। आपको स्पार्क कॉन्फिडेंस page.spark.apache.org/docs/latest/configuration.html पर कॉन्फ़िगरेशन की जांच करनी चाहिए।

इस उत्तर के अतिरिक्त मैं आपको अपने कार्यों के परिणाम को कम करने का सुझाव देना चाहता हूं अन्यथा आपको क्रमबद्धता में परेशानी हो सकती है।

2
Francesco Boi 9 जिंदा 2019, 15:59

आपकी apache-spark निर्देशिका में जांचें कि आपके पास फ़ाइल apache-spark/2.4.0/libexec/conf/spark-defaults.conf है जहां 2.4.0 अपाचे-स्पार्क संस्करण से मेल खाती है।

यदि यह फ़ाइल मौजूद नहीं है, तो इसे बनाएं।

फिर फ़ाइल के अंत में डालें: spark.driver.memory 12g

इसे --executor-memory 12G की आवश्यकता के बिना हल करना चाहिए: बस spark-submit Test.py करें।

1
Francesco Boi 28 नवम्बर 2019, 18:36

क्या आपने स्पार्क-सबमिट करते समय जेवीएम के अधिकतम ढेर आकार मूल्य की जांच की थी। यदि आप स्पार्क-सबमिट करने के लिए पास किए गए मान को देखते हैं तो इसका मतलब है कि आप अधिकतम ढेर आकार को सही तरीके से सेट कर सकते हैं।

उदाहरण के लिए यदि आपकी सेटिंग ./spark-submit --driver-memory 4G Test123.py के रूप में 4G है आपको -Xmx4G jvisualvm स्क्रीन पर नीचे जैसा दिखना चाहिए। यहां छवि विवरण दर्ज करें

यहां तक ​​​​कि आप अधिकतम ढेर आकार को सही ढंग से सेट कर सकते हैं, आप नई त्रुटि देख सकते हैं <कोड> 7 कार्यों के परिणाम (1158.5 एमबी) स्पार्क से बड़ा है।driver.maxResultSize (1024.0 एमबी)

स्टैकट्रेस यहाँ है

C:\Users\test\Desktop>spark-submit Test123.py -v --executor-memory  4G --driver-memory 10G  --conf spark.driver.maxResultSize=2g
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/01/12 11:00:58 INFO SparkContext: Running Spark version 2.2.0
18/01/12 11:00:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/01/12 11:00:58 INFO SparkContext: Submitted application: FilterProduct
18/01/12 11:00:58 INFO SecurityManager: Changing view acls to: test
18/01/12 11:00:58 INFO SecurityManager: Changing modify acls to: test
18/01/12 11:00:58 INFO SecurityManager: Changing view acls groups to:
18/01/12 11:00:58 INFO SecurityManager: Changing modify acls groups to:
18/01/12 11:00:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(test); groups with view permissions: Set(); users  with modify permissions: Set(test); groups with modify permissions: Set()
18/01/12 11:00:59 INFO Utils: Successfully started service 'sparkDriver' on port 63460.
18/01/12 11:00:59 INFO SparkEnv: Registering MapOutputTracker
18/01/12 11:00:59 INFO SparkEnv: Registering BlockManagerMaster
18/01/12 11:00:59 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/01/12 11:00:59 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/01/12 11:00:59 INFO DiskBlockManager: Created local directory at C:\Users\test\AppData\Local\Temp\blockmgr-49e8e874-1361-4fe3-a8f5-02beca717299
18/01/12 11:00:59 INFO MemoryStore: MemoryStore started with capacity 4.1 GB
18/01/12 11:00:59 INFO SparkEnv: Registering OutputCommitCoordinator
18/01/12 11:01:00 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/01/12 11:01:00 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.179.1:4040
18/01/12 11:01:00 INFO SparkContext: Added file file:/C:/Users/test/Desktop/Test123.py at file:/C:/Users/test/Desktop/Test123.py with timestamp 1515735060704
18/01/12 11:01:00 INFO Utils: Copying C:\Users\test\Desktop\Test123.py to C:\Users\test\AppData\Local\Temp\spark-0c5cf93c-e443-4ab1-b2ea-58e1b91fa310\userFiles-c067f30d-21b5-4f60-ab0d-7df3081b1d2a\Test123.py
18/01/12 11:01:01 INFO Executor: Starting executor ID driver on host localhost
18/01/12 11:01:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63481.
18/01/12 11:01:01 INFO NettyBlockTransferService: Server created on 192.168.179.1:63481
18/01/12 11:01:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/01/12 11:01:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.179.1, 63481, None)
18/01/12 11:01:01 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.179.1:63481 with 4.1 GB RAM, BlockManagerId(driver, 192.168.179.1, 63481, None)
18/01/12 11:01:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.179.1, 63481, None)
18/01/12 11:01:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.179.1, 63481, None)
18/01/12 11:01:01 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/Users/test/Desktop/spark-warehouse/').
18/01/12 11:01:01 INFO SharedState: Warehouse path is 'file:/C:/Users/test/Desktop/spark-warehouse/'.
18/01/12 11:01:02 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
18/01/12 11:01:06 INFO FileSourceStrategy: Pruning directories with:
18/01/12 11:01:06 INFO FileSourceStrategy: Post-Scan Filters:
18/01/12 11:01:06 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
18/01/12 11:01:06 INFO FileSourceScanExec: Pushed Filters:
18/01/12 11:01:07 INFO CodeGenerator: Code generated in 385.116819 ms
18/01/12 11:01:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 277.7 KB, free 4.1 GB)
18/01/12 11:01:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KB, free 4.1 GB)
18/01/12 11:01:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.179.1:63481 (size: 23.4 KB, free: 4.1 GB)
18/01/12 11:01:08 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
18/01/12 11:01:08 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
18/01/12 11:01:08 INFO SparkContext: Starting job: collect at C:/Users/test/Desktop/Test123.py:15
18/01/12 11:01:09 INFO DAGScheduler: Got job 0 (collect at C:/Users/test/Desktop/Test123.py:15) with 72 output partitions
18/01/12 11:01:09 INFO DAGScheduler: Final stage: ResultStage 0 (collect at C:/Users/test/Desktop/Test123.py:15)
18/01/12 11:01:09 INFO DAGScheduler: Parents of final stage: List()
18/01/12 11:01:09 INFO DAGScheduler: Missing parents: List()
18/01/12 11:01:09 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at javaToPython at NativeMethodAccessorImpl.java:0), which has no missing parents
18/01/12 11:01:09 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.2 KB, free 4.1 GB)
18/01/12 11:01:09 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.9 KB, free 4.1 GB)
18/01/12 11:01:09 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.179.1:63481 (size: 3.9 KB, free: 4.1 GB)
18/01/12 11:01:09 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
18/01/12 11:01:09 INFO DAGScheduler: Submitting 72 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at javaToPython at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
18/01/12 11:01:09 INFO TaskSchedulerImpl: Adding task set 0.0 with 72 tasks
18/01/12 11:01:09 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:09 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:09 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:09 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:09 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/01/12 11:01:09 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
18/01/12 11:01:09 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
18/01/12 11:01:09 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/01/12 11:01:09 INFO Executor: Fetching file:/C:/Users/test/Desktop/Test123.py with timestamp 1515735060704
18/01/12 11:01:09 INFO Utils: C:\Users\test\Desktop\Test123.py has been previously copied to C:\Users\test\AppData\Local\Temp\spark-0c5cf93c-e443-4ab1-b2ea-58e1b91fa310\userFiles-c067f30d-21b5-4f60-ab0d-7df3081b1d2a\Test123.py
18/01/12 11:01:09 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 134217728-268435456, partition values: [empty row]
18/01/12 11:01:09 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 402653184-536870912, partition values: [empty row]
18/01/12 11:01:09 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 0-134217728, partition values: [empty row]
18/01/12 11:01:09 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 268435456-402653184, partition values: [empty row]
18/01/12 11:01:09 INFO CodeGenerator: Code generated in 19.618119 ms
18/01/12 11:01:33 INFO MemoryStore: Block taskresult_1 stored as bytes in memory (estimated size 198.1 MB, free 3.9 GB)
18/01/12 11:01:33 INFO BlockManagerInfo: Added taskresult_1 in memory on 192.168.179.1:63481 (size: 198.1 MB, free: 3.9 GB)
18/01/12 11:01:33 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 207722505 bytes result sent via BlockManager)
18/01/12 11:01:33 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:34 INFO MemoryStore: Block taskresult_0 stored as bytes in memory (estimated size 198.0 MB, free 3.7 GB)
18/01/12 11:01:34 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
18/01/12 11:01:34 INFO BlockManagerInfo: Added taskresult_0 in memory on 192.168.179.1:63481 (size: 198.0 MB, free: 3.7 GB)
18/01/12 11:01:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 207614031 bytes result sent via BlockManager)
18/01/12 11:01:34 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 536870912-671088640, partition values: [empty row]
18/01/12 11:01:34 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:34 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
18/01/12 11:01:34 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 671088640-805306368, partition values: [empty row]
18/01/12 11:01:34 INFO MemoryStore: Block taskresult_2 stored as bytes in memory (estimated size 198.1 MB, free 3.5 GB)
18/01/12 11:01:35 INFO BlockManagerInfo: Added taskresult_2 in memory on 192.168.179.1:63481 (size: 198.1 MB, free: 3.5 GB)
18/01/12 11:01:35 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 207773776 bytes result sent via BlockManager)
18/01/12 11:01:35 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:35 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
18/01/12 11:01:35 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 805306368-939524096, partition values: [empty row]
18/01/12 11:01:35 INFO MemoryStore: Block taskresult_3 stored as bytes in memory (estimated size 198.0 MB, free 3.3 GB)
18/01/12 11:01:35 INFO BlockManagerInfo: Added taskresult_3 in memory on 192.168.179.1:63481 (size: 198.0 MB, free: 3.3 GB)
18/01/12 11:01:35 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 207626541 bytes result sent via BlockManager)
18/01/12 11:01:35 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:35 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
18/01/12 11:01:35 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 939524096-1073741824, partition values: [empty row]
18/01/12 11:01:35 INFO TransportClientFactory: Successfully created connection to /192.168.179.1:63481 after 345 ms (0 ms spent in bootstraps)
18/01/12 11:01:40 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 30921 ms on localhost (executor driver) (1/72)
18/01/12 11:01:40 INFO BlockManagerInfo: Removed taskresult_0 on 192.168.179.1:63481 in memory (size: 198.0 MB, free: 3.5 GB)
18/01/12 11:01:42 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 33147 ms on localhost (executor driver) (2/72)
18/01/12 11:01:42 INFO BlockManagerInfo: Removed taskresult_1 on 192.168.179.1:63481 in memory (size: 198.1 MB, free: 3.7 GB)
18/01/12 11:01:44 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 34693 ms on localhost (executor driver) (3/72)
18/01/12 11:01:44 INFO BlockManagerInfo: Removed taskresult_3 on 192.168.179.1:63481 in memory (size: 198.0 MB, free: 3.9 GB)
18/01/12 11:01:46 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 36860 ms on localhost (executor driver) (4/72)
18/01/12 11:01:46 INFO BlockManagerInfo: Removed taskresult_2 on 192.168.179.1:63481 in memory (size: 198.1 MB, free: 4.1 GB)
18/01/12 11:01:55 INFO MemoryStore: Block taskresult_4 stored as bytes in memory (estimated size 198.1 MB, free 3.9 GB)
18/01/12 11:01:55 INFO BlockManagerInfo: Added taskresult_4 in memory on 192.168.179.1:63481 (size: 198.1 MB, free: 3.9 GB)
18/01/12 11:01:55 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 207685862 bytes result sent via BlockManager)
18/01/12 11:01:55 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:55 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
18/01/12 11:01:55 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 1073741824-1207959552, partition values: [empty row]
18/01/12 11:01:56 INFO MemoryStore: Block taskresult_5 stored as bytes in memory (estimated size 198.1 MB, free 3.7 GB)
18/01/12 11:01:56 INFO BlockManagerInfo: Added taskresult_5 in memory on 192.168.179.1:63481 (size: 198.1 MB, free: 3.7 GB)
18/01/12 11:01:56 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 207774348 bytes result sent via BlockManager)
18/01/12 11:01:57 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 5290 bytes)
18/01/12 11:01:57 ERROR TaskSetManager: Total size of serialized results of 6 tasks (1188.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
18/01/12 11:01:57 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
18/01/12 11:01:57 INFO BlockManagerInfo: Removed taskresult_5 on 192.168.179.1:63481 in memory (size: 198.1 MB, free: 3.9 GB)
18/01/12 11:01:57 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/copy/SaleLine/SaleLine, range: 1207959552-1342177280, partition values: [empty row]
18/01/12 11:01:57 INFO TaskSchedulerImpl: Cancelling stage 0
18/01/12 11:01:57 INFO TaskSchedulerImpl: Stage 0 was cancelled
18/01/12 11:01:57 INFO Executor: Executor is trying to kill task 9.0 in stage 0.0 (TID 9), reason: stage cancelled
18/01/12 11:01:57 INFO Executor: Executor is trying to kill task 6.0 in stage 0.0 (TID 6), reason: stage cancelled
18/01/12 11:01:57 INFO Executor: Executor killed task 9.0 in stage 0.0 (TID 9), reason: stage cancelled
18/01/12 11:01:57 INFO DAGScheduler: ResultStage 0 (collect at C:/Users/test/Desktop/Test123.py:15) failed in 47.966 s due to Job aborted due to stage failure: Total size of serialized results of 6 tasks (1188.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
18/01/12 11:01:57 INFO MemoryStore: Block taskresult_7 stored as bytes in memory (estimated size 198.1 MB, free 3.7 GB)
18/01/12 11:01:57 INFO Executor: Executor is trying to kill task 7.0 in stage 0.0 (TID 7), reason: stage cancelled
18/01/12 11:01:57 INFO BlockManagerInfo: Added taskresult_7 in memory on 192.168.179.1:63481 (size: 198.1 MB, free: 3.7 GB)
18/01/12 11:01:57 INFO Executor: Executor is trying to kill task 8.0 in stage 0.0 (TID 8), reason: stage cancelled
18/01/12 11:01:57 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 207695395 bytes result sent via BlockManager)
18/01/12 11:01:57 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, localhost, executor driver): TaskKilled (stage cancelled)
18/01/12 11:01:57 INFO DAGScheduler: Job 0 failed: collect at C:/Users/test/Desktop/Test123.py:15, took 48.556068 s
18/01/12 11:01:57 INFO Executor: Executor killed task 8.0 in stage 0.0 (TID 8), reason: stage cancelled
18/01/12 11:01:57 ERROR TaskSetManager: Total size of serialized results of 7 tasks (1386.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
18/01/12 11:01:57 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, localhost, executor driver): TaskKilled (stage cancelled)
18/01/12 11:01:57 INFO BlockManagerInfo: Removed taskresult_7 on 192.168.179.1:63481 in memory (size: 198.1 MB, free: 3.9 GB)
Traceback (most recent call last):
  File "C:/Users/test/Desktop/Test123.py", line 15, in <module>
    lines.collect()
  File "D:\workspace\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 809, in collect
  File "D:\workspace\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "D:\workspace\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
  File "D:\workspace\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 6 tasks (1188.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
        at scala.collection.m18/01/12 11:01:57 INFO MemoryStore: Block taskresult_6 stored as bytes in memory (estimated size 198.1 MB, free 3.5 GB)
utable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.ref18/01/12 11:01:57 INFO BlockManagerInfo: Added taskresult_6 in memory on 192.168.179.1:63481 (size: 198.1 MB, free: 3.7 GB)
lect.Method.invoke(Method.java:483)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)

18/01/12 11:01:57 INFO SparkContext: Invoking stop() from shutdown hook
18/01/12 11:01:57 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 207681096 bytes result sent via BlockManager)
18/01/12 11:01:57 ERROR TaskSetManager: Total size of serialized results of 8 tasks (1584.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
18/01/12 11:01:57 INFO SparkUI: Stopped Spark web UI at http://192.168.179.1:4040
18/01/12 11:01:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/01/12 11:01:57 INFO BlockManagerInfo: Removed taskresult_6 on 192.168.179.1:63481 in memory (size: 198.1 MB, free: 3.9 GB)
18/01/12 11:01:57 ERROR Utils: Uncaught exception in thread task-result-getter-1
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
        at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:105)
        at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:642)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Exception in thread "task-result-getter-1" java.lang.Error: java.lang.InterruptedException
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
        at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:105)
        at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:642)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        ... 2 more
18/01/12 11:01:57 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 192.168.179.1, 63481, None),taskresult_6,StorageLevel(1 replicas),0,0))
18/01/12 11:01:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/12 11:01:58 ERROR TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=882113328004, chunkIndex=0}, buffer=org.apache.spark.storage.BlockManagerManagedBuffer@f00f35c} to /192.168.179.1:63484; closing connection
java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.close(...)(Unknown Source)
18/01/12 11:01:58 INFO MemoryStore: MemoryStore cleared
18/01/12 11:01:58 INFO BlockManager: BlockManager stopped
18/01/12 11:01:58 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/12 11:01:58 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /192.168.179.1:63481 is closed
18/01/12 11:01:58 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/01/12 11:01:58 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
18/01/12 11:01:58 INFO SparkContext: Successfully stopped SparkContext
18/01/12 11:01:58 INFO ShutdownHookManager: Shutdown hook called
18/01/12 11:01:58 INFO ShutdownHookManager: Deleting directory C:\Users\test\AppData\Local\Temp\spark-0c5cf93c-e443-4ab1-b2ea-58e1b91fa310\pyspark-90628145-a2e4-437f-bbe5-ef41ed4ba64f
18/01/12 11:01:58 INFO ShutdownHookManager: Deleting directory C:\Users\test\AppData\Local\Temp\spark-0c5cf93c-e443-4ab1-b2ea-58e1b91fa310

आप spark-submit --driver-memory 2g --conf "spark.driver.maxResultSize=2g" Test.py जैसे Spark.driver.maxResultSize मान को समायोजित कर सकते हैं

यहां तक ​​​​कि आपको त्रुटि भी मिलती है, आप इस आदेश से वर्बोज़ साझा कर सकते हैं। spark-submit --driver-memory 10g --conf "spark.driver.maxResultSize=2g" Test.py ताकि हम आपके प्लेटफ़ॉर्म की विशिष्ट शर्तें देख सकें।

-2
Sai 12 जिंदा 2018, 10:38
उत्तर के लिए धन्यवाद, यहां तक ​​​​कि मैं pyspark स्पार्क-2.2.0 का उपयोग कर रहा हूं, मैंने ड्राइवर और निष्पादक मेमोरी बढ़ाकर सभी विकल्पों का प्रयास किया है, फिर भी मुझे त्रुटि मिल रही है।
 – 
Sai
10 जिंदा 2018, 09:34
 – 
Sai
10 जिंदा 2018, 09:36
उत्तर को संशोधित किया। आप चेक कर सकते हैं।
 – 
Mehmet Sunkur
10 जिंदा 2018, 19:11
क्या आप स्पार्क-सबमिट शुरू करते समय पैरामीटर -v (वर्बोज़ आउटपुट के लिए) जोड़ सकते हैं और पूर्ण आउटपुट साझा कर सकते हैं।
 – 
Mehmet Sunkur
11 जिंदा 2018, 13:38
वर्बोज़ आउटपुट जोड़ा गया, कृपया एक नज़र डालें।
 – 
Sai
12 जिंदा 2018, 08:44